smartie.py: CCP4 logfile parser tools
Usage documentation for version 0.0.15
Introduction
smartie is a set of Python classes and methods intended
to provide tools for parsing the content of CCP4 logfiles. The name
"smartie" reflects its origins as the driver for a "smart logfile
browser", although this aim has not yet been realised.
The logfile class lies at the heart of smartie. Once
populated from a file, a logfile object gives a high-level view of
that file in terms of "components" (CCP4i comments, individual
program logs, tables, warnings, summaries and so on). The logfile
class was loosely inspired by the Javascript DOM (document object model)
for describing hypertext documents, although it is far more limited
at present.
To see an example of smartie parsing a logfile, feed it one of
the example logfiles included in the distribution, e.g.:
% python smartie.py 7_refmac5.log
or all the example log files:
% python smartie.py *.log | more
or try it on a file of your own.
Module documentation
The documentation for the classes and functions in the smartie
module (generated using pydoc) are here:
There are also overviews of the different
classes below.
Usage examples
1. Interrogating logfiles
To create a new logfile object describing (say) a CCP4i logfile
from a scala job, we use the smartie's parselog
method:
>>> import smartie
>>> logfile = smartie.parselog("22_scala.log")
We can then find out for example how many "fragments" smartie
thought it had found in this file:
>>> logfile.nfragments()
8
Fragments are any particular chunk of logfile that smartie
recognised. We can interrogate the type of fragment, for example:
>>> logfile.fragment(0).isccp4i_info()
True
If we're only interested in the fragments that looked like individual
program output then we can also find out how many programs it thought
it had found, by querying its list of program logs:
>>> logfile.nprograms()
7
We can ask it some questions about individual fragments, for
example:
>>> logfile.program(1).isccp4()
True
>>> logfile.program(1).name
'Scala'
>>> logfile.program(1).version
'5.99'
>>> logfile.program(1).termination_message
'** Normal termination **'
It is also possible to get a list of the keyword input lines
for each program logfile, for example:
>>> logfile.program(0).name
'SORTMTZ'
>>> logfile.program(0).keywords()
['ASCEND', 'H K L M/ISYM BATCH']
For each program any "logical name/filename" pairs found in
the logfile are also stored and can be retrieved. To get a list
of the logical names associated with files that were opened:
>>> logfile.program(1).logicalnames()
['HKLIN', 'HKLOUT', 'SYMINFO']
Then, to find out what the associated file is for a particular
logical name:
>>> logfile.program(1).logicalnamefile("HKLIN")
'/home/pjx/PROJECTS/myProject/aucn_sorted.mtz'
>>> logfile.program(1).logicalnamefile("HKLOUT")
'/tmp/pjx/PROJECT_22_2_mtz.tmp'
The logfile class offers similar methods for fragments which
are not CCP4 program output but are (for example) messages from
CCP4i. Smartie also has a summarise function which will
print a report of the logfile contents:
>>> logfile = smartie.parselog("22_scala.log")
>>> smartie.summarise(logfile)
Summary for 22_scala.log
This is a CCP4i logfile
8 logfile fragments
Fragments:
CCP4i info
Program: SORTMTZ
Program: Scala
Program: MTZDUMP
Program: UNIQUE
Program: FREERFLAG
Program: CAD
Program: FREERFLAG
7 program logfiles
Programs:
SORTMTZ v5.99 (CCP4 5.99)
Scala v5.99 (CCP4 5.99)
Tables:
Table: ">>> Scales v rotation range, red_aucn"
Table: "Analysis against Batch, red_aucn"
Table: "Analysis against resolution , red_aucn"
...
The fragments and fragment-derived objects (programs and ccp4i_info)
also allow the text of the fragment to be retrieved from the logfile.
For example:
>>> logfile = smartie.parselog("22_scala.log")
>>> prog = logfile.program(0)
>>> print prog.retrieve()
###############################################################
###############################################################
###############################################################
### CCP4 5.99: SORTMTZ version 5.99 : 06/09/05##
###############################################################
User: pjx Run date: 31/ 1/2006 Run time: 16:02:55
...
2. Extracting summaries from marked up logfiles
As of version 0.0.8, logfile objects store information about blocks
of "summary" text that are found in the source logfile. A summary block
is a section of logfile output that is enclosed within
<!--SUMMARY_BEGIN--> and <!--SUMMARY_END--> tags, for
example:
<B><FONT COLOR="#FF0000"><!--SUMMARY_BEGIN-->
================================================================================
Summary data for Project: DMSO Crystal: DMSO Dataset: red_aucn
Overall OuterShell
Low resolution limit 35.27 3.16
High resolution limit 3.00 3.00
...
<!--SUMMARY_END--></FONT></B>
There are a number of methods available to interrogate the summary block
information, for example: to find out how many summary blocks a logfile
contains:
>>> logfile = smartie.parselog("22_scala.log")
>>> logfile.nsummaries()
30
i.e. the logfile holds 30 summary blocks. For each summary block the
start and end lines in the source logfile can be retrieved, as can the
actual text, for example to get information on the 13th summary block in
a log file:
>>> logfile.summary(12).start()
853
>>> logfile.summary(12).end()
855
>>> print logfile.summary(12).retrieve()
<B><FONT COLOR="#FF0000"><!--SUMMARY_BEGIN-->
Logical name: CORRELPLOT, Filename: /home/pjx/PROJECTS/myProject/PROJECT_22_correlplot.xmgr
<!--SUMMARY_END--></FONT></B>
It's not clear that this functionality is particularly useful. One
application is to write out all the summaries in one go e.g.:
>>> for i in range(0,logfile.nsummaries()):
... print logfile.summary(i).retrieve()
...
(In this last example, Smartie's strip_logfile_html() command
could also be used to remove any HTML tags in the output - and to escape any
HTML special characters - in order to make the summary output easier to
read.)
This functionality is provided in the
show_summary.py example script.
3. Working with tables and graphs
Once a logfile object has been constructed from a file, smartie
offers various ways to find out about the tables associated with
the file overall, and with individual programs and fragments.
We can ask it about tables that it found for an individual
fragment or program:
>>> logfile.fragment(2).ntables()
7
>>> logfile.program(1).tables()[3].title()
'Analysis against intensity, red_aucn'
>>> logfile.program(1).tables()[4].ngraphs()
4
>>> logfile.program(1).tables()[4].table_graph(0).title()
'Completeness v Resolution '
>>> logfile.program(1).tables()[4].nrows()
10
We can also ask it similar questions about tables in the logfile
as a whole, for example:
>>> logfile.ntables()
7
We can also fetch a table in a logfile or a program by specifying
a regular expression pattern that matches the table title, for
example:
>>> logfile = smartie.parselog("7_refmac5.log")
>>> logfile.tables("Rfactor analysis, stats vs cycle")[0].title()
'Rfactor analysis, stats vs cycle'
>>> logfile.program(0).tables("Cycle 11")[0].title()
'Cycle 11. Rfactor analysis, F distribution v resln'
For a particular table we can get the values for a particular
column:
>>> tbl = logfile.tables()[6]
>>> tbl.col("Rfree")
['0.178', '0.196', '0.204', '0.210', '0.215', '0.221', '0.222', '0.225',
'0.227', '0.228', '0.229']
>>> tbl.col("Rfree")[-1]
'0.229'
4. Using the table class to create tables and graphs
smartie's table, table_graph and table_columns are intended to
be useful not only for reading tables from logfiles, but also for
constructing and writing them.
In outline the steps involved are:
- Create a new table e.g.
tbl = smartie.table(title)
- Define a set of columns e.g.
tbl.addcolumn(column_name)
- Add data to the table "row-wise" e.g.
tbl.add_data(dictionary)
- (Optionally) add graph definitions e.g.
tbl.definegraph(title,column_list)
Graph definitions are required in order to generate $TABLE marked-up
loggraph table using the table.loggraph() and related methods.
An example of creating and populating a new table can be found in
the smartie.table_example() method. Alternatively:
>>> tbl = smartie.table("A table with random data")
>>> for i in range(0,3):
... col = tbl.addcolumn("col_"+str(i))
...
>>> for j in range(0,6):
... tbl.add_data({"col_0":j,"col_1":j*2,"col_2":j*3})
...
>>> tbl.definegraph("An arbitrary graph",("col_0","col_1"))
>>> tbl.definegraph("Another arbitrary graph",("col_0","col_2"))
>>> print tbl.loggraph()
$TABLE: A table with random data:
$GRAPHS
:An arbitrary graph:A:1,2:
:Another arbitrary graph:A:1,3:
$$
col_0 col_1 col_2 $$ $$
0 0 0
1 2 3
2 4 6
3 6 9
4 8 12
5 10 15
$$
If you want the graph to be written with the Jloggraph applet markup also
included then you can use instead:
>>> print tbl.jloggraph()
<applet width="400" height="300" code="JLogGraph.class"
codebase=""><param name="table" value="
$TABLE: A table with random data:
$GRAPHS
:An arbitrary graph:A:1,2:
:Another arbitrary graph:A:1,3:
$$
col_0 col_1 col_2 $$ $$
0 0 0
1 2 3
2 4 6
3 6 9
4 8 12
5 10 15
$$"><b>For inline graphs use a Java browser</b></applet>
Alternatively, the show method will just return the table body
(column titles plus data) as a block of text without any additional markup,
and the html method will return a similar table formatted with the
appropriate HTML tags (and with any special characters converted to their
HTML equivalents for correct display).
Overview of Smartie classes
Smartie offers the following principle classes:
- the logfile class, which gives a high-level description
of a logfile made up of smaller fragments (currently called
"programs" within the context of smartie, although not all of
them are actually program logfiles)
- the fragment class, which describes a generic fragment of
a logfile, and additional program and ccp4i_info
classes that are derived from it, and which typically describe the
log from a single CCP4 program and messages from CCP4i respectively.
- the table class, which describes a CCP4 formatted table
- the summary class, which describes the location of a block
of summary text (enclosed within CCP4 summary tags) within a
logfile
There are a number of additional classes support to support these:
- the table_graph and table_column classes provide a way to
interact with the components of tables
- the keytext class describes warnings from CCP4 programs
that are typically found embedded in program logfiles.
Finally there are also some classes which are primarily intended for
use internally to smartie:
- the buffer and tablebuffer classes
A logfile object is populated and returned by the
parselog() function. This takes a file name as a single
compulsory argument; the optional "progress" argument specifies
a number of lines at which to report progress when parsing the
file. It can recognise the following features in a logfile:
- CCP4 program banners (both standard and "phaser-style")
- CCP4 program terminations (both standard and "phaser-style")
- CCP4-formatted tables
- CCP4-formatted warnings from CCPERR
- CCP4i logfile "head" and "tail" (termination)
- CCP4i information messages
- CCP4 "summary" tags enclosing blocks of summary text
parselog reads a logfile and returns a logfile object based
on the file contents. A logfile object holds lists of fragments,
programs, tables, keytext messages,
CCP4i information messages and summary blocks.
Applications using Smartie
Smartie is currently used in three applications:
- The MrBUMP automated
molecular replacement program uses Smartie's table extraction and
manipulation functionality to help in processing output from
some of the underlying programs
- The "baubles" program uses smartie to analyse a CCP4 logfile
before reformatting it in a mixture of HTML, CSS and Javascript
for display in a web browser.
- The "starKey" program developed as part of CCP4's contribution to
the BIOXHIT project uses Smartie to gather information about which
programs were run by a particular CCP4i task - see
CCP4-BIOXHIT:
Available Files
Known issues/to-do list
- When constructing tables there is no way to validate them (e.g.
checking that there are the correct number of rows in each
column etc) before writing out
- The API for the table and related classes needs a review and
overhaul in the context of actual usage
- The fragment class should have better methods to query what type
of fragment (program, table etc).
- Deal properly with HTML and SUMMARY markup of CCP4 logfiles -
for example, by taking this into account when looking for start
and end
- Improvements to the CCP4i header, tail and messages - most
likely make these into dedicated subclasses of the fragment
class.
Change log
Changes in 0.0.15
- buffer class now supports len(buffer) for retrieving
number of lines stored.
- Table parsing errors: parselog reports filename and location of
tables that cannot be properly parsed, and warning from
table.__populate_columns() has been updated to be less cryptic.
Changes in 0.0.14
- table.show() has new optional argument pad_columns,
which controls whether data items in table columns are padded with
spaces in order to align them. table.loggraph() also
supports this argument.
- strip_logfile_html tries to salvage spacegroup names that have
been written in the form <P 41>, using new function
tag_is_spacegroup.
- buffer class has a new all() method: this returns
the whole buffer as a string, so can be used instead of the tail
method for buffer objects which do not have a fixed size.
- parselog now creates table objects for any table-like
feature that is found, even if the table cannot subsequently be parsed.
The raw text of a table is always available using the
table.rawtext() method, even if the table is not parsed
correctly, and diagnostic warnings to stdout have been reduced to a
single line.
- parselog: the limit of 1000 lines has been removed from the
tablebuffer, so arbitrarily large tables can now be processed (within
system limits).
Changes in 0.0.13
- Bug fix: fix len(fragment) function so that it always
returns zero or greater.
Changes in 0.0.12
- Bug fix: patternmatch.isccp4banner() has been updated to recognise
CCP4 version numbers that include a trailing lower-case letter
e.g. 6.0.99a. Previously programs with these version numbers
did not have their banners identified by smartie.
- The number of lines in the logfile that belong to a fragments (or
program) object can be obtained by using the len function,
e.g. len(fragment).
Changes in 0.0.11
- logfile class: the "source" log file name is now stored as an
absolute path.
- Updates to the smartie directory structure: example logfiles are
now in the "logfiles" subdirectory.
- "test" subdirectory contains a basic set of unit tests for some
of the pattern matching functions in smartie - it can be run
using python test/test_smartie.py.
Changes in 0.0.10
- show_summary.py: example script that uses smartie to process a
logfile and then prints the text enclosed in summary tags.
- $TABLE recognition failed for some examples where the table
title contains ":" characters before the end of the title string.
In this case smartie would complain that it couldn't process the
table - now it applies a second matching step with fewer
constraints before giving up.
-
- table.jloggraph(): now escapes special HTML characters in table
and graph titles when generating applet code, including double
quotes " - otherwise these can cause problems with
JLogGraph's parsing of the result table.
- escape_xml_characters(): now also escapes double quotes and
replaces them with ". Updated to use "replace" rather
than regular expressions (Kevin Cowtan).
- parselog: buffer size increased to 50 lines, which means that
features of up to 50 lines can be recognised (used to be
10) (Kevin Cowtan).
- patternmatch class: methods for detecting program banners and
termination messages now have an additional fast test which
skips the full regular expression tests if not passed - this
has resulted in a significant speed-up (Kevin Cowtan).
- retrieve functions: this has been modified and should be faster
now for big files. Previously it was slow due to sub-optimal
usage of the linecache module.
- table.show(): always printed tables as $GRAPHS, even if they were
originally $SCATTER plots - now fixed.
- Bug fix: table_column.append() always converted numerical values
to integers, regardless of whether they were actually integers or
floats. Now floats should be properly treated.
- table_graph.graphcols(): new method that returns a list of the
column names that make up the graph.
- table class: new methods to make it easier to build tables
from scratch, specifically:
table.list_columns(): return a list of the column names defined
in the table
table.add_data(): add a "row" of data to the table
table.definegraph(): add a new graph definition to the table
based on the existing columns.
The table_example() function has also been updated to show how
this new methods can be used to make a new table.
Changes in 0.0.9
- tokenise() function: now recognises either single or double
quotes as token delimiters (and quotes of one type can contain
quotes of the other type).
Changes in 0.0.8
- Minor modification to parselog, to set the start line for
"unknown" program fragments to be immediately after the end of
the previous fragment (relevant if you are using the retrieve()
methods of programs and fragments).
- Bug fix: escape_xml_characters() raised an exception if the
supplied data was not a string, which caused problems with the
table.html() method (now fixed).
- Bug fix: table.jloggraph() method did not include the $TABLE..
and other tags when generating the HTML code, so graphs would
not display correctly (now fixed).
- Significant update to docstrings to provide more extensive
documentation in smartie.html.
- strip_logfile_html() updated to extract CCP4 formatted tables
from <param...> used for JLogGraph displays.
- Removed report_example() function.
- Added new "summary" class, which describes the location of a
block of summary text within a logfile that is marked up with
CCP4 summary tags. logfile objects can store summary objects
and parselog() will add them in the order that they are located
in the file.
summary objects are not yet assocated with fragments.
- New function copyfragment() allows one fragment or fragment
subclass to be populated from another.
- logfile class: new method fragment_to_program() will convert
an existing fragment object stored in the logfile object to a
program object, at the same time updating the program lists
appropriately.
- program class: added new methods addlogicalname(),
logicalnames() and logicalnamefile() that allow the storage and
retrieval of the logical name/file name pairs found in the
program logfile.
The parselog() function now recognises the file opening reports
in the logfile and adds the data to the program object
automatically.
Changes in 0.0.7
- parselog() recognises keyworded input lines (i.e. logfile lines
starting with a " Data line---" preamble) and stores the lines
themselves in the appropriate program objects. The list of keyword
lines can be retrieved using the program.keywords() method.
Changes in 0.0.6
- Table values are now stored as floats or integers as appropriate.
Items that cannot be converted to numerical values are stored as
strings, as before.
- table object now has an nrows() method that returns the number of
rows of data in the table. The table_column object also has a new
nrows() method that returns the number of rows in a single column.
- logfile and fragment classes have new tables() method, which by
default returns the list of tables for that object. It can also
return a subset of tables that match a regular expression for
their titles, and thus supercedes the existing findtable()
methods in those classes. The tables() method also supercedes
the existing table() method, which is now deprecated.
- Bug fix: findtable() method in logfile and fragment classes didn't
work correctly, this is now fixed - see updated documentation.
However, findtable() is now deprecated in favour of the new tables()
methods in those classes.
- Bug fix: fixed fragment class so that pickle.dump() can be used
to serialise a smartie logfile object (previously caused a crash)
Changes in 0.0.5
- Bug fix: handle cases where the logfile being processed
doesn't appear to contain any recognisable features such as
program banners or tables (previously this caused a crash).
- Bug fix: handle log files with DOS-formatted line endings.
- Updated to deal with old-style (pre-5.0) CCP4 program banners
and termination messages (both 4.1 and 4.0).
- Updated to deal with Phaser 1.3.3 program banner, which
appears to differ from the 1.3.2 version.
- Table processing improved to deal with tables that don't
conform to the loggraph $TABLE specification.
Changes in 0.0.4
- table class: the findtable() method has an
optional second argument that specifies where in the list to
start the search from - see module documentation for more
detailed information.
- table class: there is a new html() method
that generates a HTML table with the column titles and
data.
Changes in 0.0.3
- Bug fix: parselog now recognises CCP4 program names that
contain bracket characters (previously failed to recognise
names like "MOLREP(ccp4)9.0.08"). Added new example file
3_molrep.log which originally exposed this
problem.
- Bug fix: parselog code was broken for handling incomplete
program log files where the start banner is missing but the
termination message is present, this is now fixed. Added new
example files scala_nobanner.log and
molrep_nobanner.log, which demonstrate this
situation.
- Added findtable() methods to logfile,
fragment and program classes. This allows
a table to be retrieved by specifying a regular expression
which matches the table title.
- The function of the table.show() has changed; by
default it now returns only the table body (column titles
plus data) without any additional markup. The behaviour in
the previous version can be obtained either by specifying
the loggraph argument as True, or by
using the new table.loggraph() method - both of
which return the table data formatted as a loggraph table.
Changes in 0.0.2
- The logfile class now has a list of fragment objects, and
the program object list is used exclusively for referring
to logfile fragments that are from programs. There are new
methods logfile.nfragments() and
logfile.fragment(i) to access the fragment list,
and the summarise function also includes summaries
of the fragments.
-
- The method for accessing attributes of a program object has
changed, so now instead of e.g. program.name() you
must use either program.name (i.e. no parentheses), or
program["name"], or program.get_attribute("name")
(the first two methods simply wrap the last one).
The change in implementation means that there are no longer
explicit methods defined in the program class to tell you
what the available attributes are; instead use the
program.attributes() method to get a list, and see
the documentation in smartie.html
for descriptions of each attribute.
- There is now a way to access the data in a table column
by referencing its name, using the new col() method
of the table class - for example
table.col("Rfree") returns a list of the values
in the "Rfree" column of the table, and
table.col("Rfree")[-1] references the last value.
- Added support for retrieving the text for a fragment
directly from the logfile - the fragment,
program and ccp4i_info classes all
support the retrieve method to allow the text to
be fetched from the file and returned as a string.
- Fixed a bug in the jloggraph method of the table
class (only the last line was ever returned).
See also
The CCP4 documentation contains details of the format and syntax for CCP4
tables, graphs and keytext messages:
Acknowledgements
Thanks to Ronan Keegan, Wendy Yang and Martyn Winn for providing
useful input to the development of smartie.
Kevin Cowtan has also provided code changes and useful feedback.
Author
Peter Briggs, 2006-8