smartie.py: CCP4 logfile parser tools

Usage documentation for version 0.0.15

Introduction

smartie is a set of Python classes and methods intended to provide tools for parsing the content of CCP4 logfiles. The name "smartie" reflects its origins as the driver for a "smart logfile browser", although this aim has not yet been realised.

The logfile class lies at the heart of smartie. Once populated from a file, a logfile object gives a high-level view of that file in terms of "components" (CCP4i comments, individual program logs, tables, warnings, summaries and so on). The logfile class was loosely inspired by the Javascript DOM (document object model) for describing hypertext documents, although it is far more limited at present.

To see an example of smartie parsing a logfile, feed it one of the example logfiles included in the distribution, e.g.:

% python smartie.py 7_refmac5.log

or all the example log files:

% python smartie.py *.log | more

or try it on a file of your own.

Module documentation

The documentation for the classes and functions in the smartie module (generated using pydoc) are here:

smartie.html

There are also overviews of the different classes below.

Usage examples

1. Interrogating logfiles

To create a new logfile object describing (say) a CCP4i logfile from a scala job, we use the smartie's parselog method:

>>> import smartie
>>> logfile = smartie.parselog("22_scala.log")

We can then find out for example how many "fragments" smartie thought it had found in this file:

>>> logfile.nfragments()
8

Fragments are any particular chunk of logfile that smartie recognised. We can interrogate the type of fragment, for example:

>>> logfile.fragment(0).isccp4i_info()
True

If we're only interested in the fragments that looked like individual program output then we can also find out how many programs it thought it had found, by querying its list of program logs:

>>> logfile.nprograms()
7

We can ask it some questions about individual fragments, for example:

>>> logfile.program(1).isccp4()
True
>>> logfile.program(1).name
'Scala'
>>> logfile.program(1).version
'5.99'
>>> logfile.program(1).termination_message
'** Normal termination **'

It is also possible to get a list of the keyword input lines for each program logfile, for example:

>>> logfile.program(0).name
'SORTMTZ'
>>> logfile.program(0).keywords()
['ASCEND', 'H K L M/ISYM BATCH']

For each program any "logical name/filename" pairs found in the logfile are also stored and can be retrieved. To get a list of the logical names associated with files that were opened:

>>> logfile.program(1).logicalnames()
['HKLIN', 'HKLOUT', 'SYMINFO']

Then, to find out what the associated file is for a particular logical name:

>>> logfile.program(1).logicalnamefile("HKLIN")
'/home/pjx/PROJECTS/myProject/aucn_sorted.mtz'
>>> logfile.program(1).logicalnamefile("HKLOUT")
'/tmp/pjx/PROJECT_22_2_mtz.tmp'

The logfile class offers similar methods for fragments which are not CCP4 program output but are (for example) messages from CCP4i. Smartie also has a summarise function which will print a report of the logfile contents:

>>> logfile = smartie.parselog("22_scala.log")
>>> smartie.summarise(logfile)
Summary for 22_scala.log

This is a CCP4i logfile

8 logfile fragments

Fragments:
        CCP4i info
        Program: SORTMTZ
        Program: Scala
        Program: MTZDUMP
        Program: UNIQUE
        Program: FREERFLAG
        Program: CAD
        Program: FREERFLAG

7 program logfiles

Programs:
        SORTMTZ v5.99   (CCP4 5.99)
        Scala   v5.99   (CCP4 5.99)

                Tables:
                Table: ">>> Scales v rotation range, red_aucn"
                Table: "Analysis against Batch, red_aucn"
                Table: "Analysis against resolution , red_aucn"

...

The fragments and fragment-derived objects (programs and ccp4i_info) also allow the text of the fragment to be retrieved from the logfile. For example:

>>> logfile = smartie.parselog("22_scala.log")
>>> prog = logfile.program(0)
>>> print prog.retrieve()
 ###############################################################
 ###############################################################
 ###############################################################
 ### CCP4 5.99: SORTMTZ            version 5.99      : 06/09/05##
 ###############################################################
 User: pjx  Run date: 31/ 1/2006 Run time: 16:02:55

...

2. Extracting summaries from marked up logfiles

As of version 0.0.8, logfile objects store information about blocks of "summary" text that are found in the source logfile. A summary block is a section of logfile output that is enclosed within  and  tags, for example:

<B><FONT COLOR="#FF0000"><!--SUMMARY_BEGIN-->

================================================================================

Summary data for Project: DMSO Crystal: DMSO Dataset: red_aucn

                                           Overall  OuterShell

  Low resolution limit                       35.27      3.16
  High resolution limit                       3.00      3.00
...
<!--SUMMARY_END--></FONT></B>

There are a number of methods available to interrogate the summary block information, for example: to find out how many summary blocks a logfile contains:

>>> logfile = smartie.parselog("22_scala.log")
>>> logfile.nsummaries()
30

i.e. the logfile holds 30 summary blocks. For each summary block the start and end lines in the source logfile can be retrieved, as can the actual text, for example to get information on the 13th summary block in a log file:

>>> logfile.summary(12).start()
853
>>> logfile.summary(12).end()
855
>>> print logfile.summary(12).retrieve()
<B><FONT COLOR="#FF0000"><!--SUMMARY_BEGIN-->
Logical name: CORRELPLOT, Filename: /home/pjx/PROJECTS/myProject/PROJECT_22_correlplot.xmgr
<!--SUMMARY_END--></FONT></B>

It's not clear that this functionality is particularly useful. One application is to write out all the summaries in one go e.g.:

>>> for i in range(0,logfile.nsummaries()):
...    print logfile.summary(i).retrieve()
...

(In this last example, Smartie's strip_logfile_html() command could also be used to remove any HTML tags in the output - and to escape any HTML special characters - in order to make the summary output easier to read.)

This functionality is provided in the show_summary.py example script.

3. Working with tables and graphs

Once a logfile object has been constructed from a file, smartie offers various ways to find out about the tables associated with the file overall, and with individual programs and fragments.

We can ask it about tables that it found for an individual fragment or program:

>>> logfile.fragment(2).ntables()
7
>>> logfile.program(1).tables()[3].title()
'Analysis against intensity, red_aucn'
>>> logfile.program(1).tables()[4].ngraphs()
4
>>> logfile.program(1).tables()[4].table_graph(0).title()
'Completeness v Resolution '
>>> logfile.program(1).tables()[4].nrows()
10

We can also ask it similar questions about tables in the logfile as a whole, for example:

>>> logfile.ntables()
7

We can also fetch a table in a logfile or a program by specifying a regular expression pattern that matches the table title, for example:

>>> logfile = smartie.parselog("7_refmac5.log")
>>> logfile.tables("Rfactor analysis, stats vs cycle")[0].title()
'Rfactor analysis, stats vs cycle'
>>> logfile.program(0).tables("Cycle   11")[0].title()
'Cycle   11. Rfactor analysis, F distribution v resln'

For a particular table we can get the values for a particular column:

>>> tbl = logfile.tables()[6]
>>> tbl.col("Rfree")
['0.178', '0.196', '0.204', '0.210', '0.215', '0.221', '0.222', '0.225',
'0.227', '0.228', '0.229']
>>> tbl.col("Rfree")[-1]
'0.229'

4. Using the table class to create tables and graphs

smartie's table, table_graph and table_columns are intended to be useful not only for reading tables from logfiles, but also for constructing and writing them.

In outline the steps involved are:

Create a new table e.g. tbl = smartie.table(title)
Define a set of columns e.g. tbl.addcolumn(column_name)
Add data to the table "row-wise" e.g. tbl.add_data(dictionary)
(Optionally) add graph definitions e.g. tbl.definegraph(title,column_list)

Graph definitions are required in order to generate $TABLE marked-up loggraph table using the table.loggraph() and related methods.

An example of creating and populating a new table can be found in the smartie.table_example() method. Alternatively:

>>> tbl = smartie.table("A table with random data")
>>> for i in range(0,3):
...     col = tbl.addcolumn("col_"+str(i))
...
>>> for j in range(0,6):
...     tbl.add_data({"col_0":j,"col_1":j*2,"col_2":j*3})
...
>>> tbl.definegraph("An arbitrary graph",("col_0","col_1"))
>>> tbl.definegraph("Another arbitrary graph",("col_0","col_2"))
>>> print tbl.loggraph()
$TABLE: A table with random data:
$GRAPHS
 :An arbitrary graph:A:1,2:
 :Another arbitrary graph:A:1,3:
$$
  col_0  col_1  col_2 $$ $$
      0      0      0
      1      2      3
      2      4      6
      3      6      9
      4      8     12
      5     10     15
$$

If you want the graph to be written with the Jloggraph applet markup also included then you can use instead:

>>> print tbl.jloggraph()
<applet width="400" height="300" code="JLogGraph.class"
codebase=""><param name="table" value="
$TABLE: A table with random data:
$GRAPHS
 :An arbitrary graph:A:1,2:
 :Another arbitrary graph:A:1,3:
$$
  col_0  col_1  col_2 $$ $$
      0      0      0
      1      2      3
      2      4      6
      3      6      9
      4      8     12
      5     10     15
$$"><b>For inline graphs use a Java browser</b></applet>

Alternatively, the show method will just return the table body (column titles plus data) as a block of text without any additional markup, and the html method will return a similar table formatted with the appropriate HTML tags (and with any special characters converted to their HTML equivalents for correct display).

Overview of Smartie classes

Smartie offers the following principle classes:

the logfile class, which gives a high-level description of a logfile made up of smaller fragments (currently called "programs" within the context of smartie, although not all of them are actually program logfiles)
the fragment class, which describes a generic fragment of a logfile, and additional program and ccp4i_info classes that are derived from it, and which typically describe the log from a single CCP4 program and messages from CCP4i respectively.
the table class, which describes a CCP4 formatted table
the summary class, which describes the location of a block of summary text (enclosed within CCP4 summary tags) within a logfile

There are a number of additional classes support to support these:

the table_graph and table_column classes provide a way to interact with the components of tables
the keytext class describes warnings from CCP4 programs that are typically found embedded in program logfiles.

Finally there are also some classes which are primarily intended for use internally to smartie:

the buffer and tablebuffer classes

A logfile object is populated and returned by the parselog() function. This takes a file name as a single compulsory argument; the optional "progress" argument specifies a number of lines at which to report progress when parsing the file. It can recognise the following features in a logfile:

CCP4 program banners (both standard and "phaser-style")
CCP4 program terminations (both standard and "phaser-style")
CCP4-formatted tables
CCP4-formatted warnings from CCPERR
CCP4i logfile "head" and "tail" (termination)
CCP4i information messages
CCP4 "summary" tags enclosing blocks of summary text

parselog reads a logfile and returns a logfile object based on the file contents. A logfile object holds lists of fragments, programs, tables, keytext messages, CCP4i information messages and summary blocks.

Applications using Smartie

Smartie is currently used in three applications:

The MrBUMP automated molecular replacement program uses Smartie's table extraction and manipulation functionality to help in processing output from some of the underlying programs
The "baubles" program uses smartie to analyse a CCP4 logfile before reformatting it in a mixture of HTML, CSS and Javascript for display in a web browser.
The "starKey" program developed as part of CCP4's contribution to the BIOXHIT project uses Smartie to gather information about which programs were run by a particular CCP4i task - see CCP4-BIOXHIT: Available Files

Known issues/to-do list

When constructing tables there is no way to validate them (e.g. checking that there are the correct number of rows in each column etc) before writing out
The API for the table and related classes needs a review and overhaul in the context of actual usage
The fragment class should have better methods to query what type of fragment (program, table etc).
Deal properly with HTML and SUMMARY markup of CCP4 logfiles - for example, by taking this into account when looking for start and end
Improvements to the CCP4i header, tail and messages - most likely make these into dedicated subclasses of the fragment class.

Change log

Changes in 0.0.15

buffer class now supports len(buffer) for retrieving number of lines stored.
Table parsing errors: parselog reports filename and location of tables that cannot be properly parsed, and warning from table.__populate_columns() has been updated to be less cryptic.

Changes in 0.0.14

table.show() has new optional argument pad_columns, which controls whether data items in table columns are padded with spaces in order to align them. table.loggraph() also supports this argument.
strip_logfile_html tries to salvage spacegroup names that have been written in the form <P 41>, using new function tag_is_spacegroup.
buffer class has a new all() method: this returns the whole buffer as a string, so can be used instead of the tail method for buffer objects which do not have a fixed size.
parselog now creates table objects for any table-like feature that is found, even if the table cannot subsequently be parsed. The raw text of a table is always available using the table.rawtext() method, even if the table is not parsed correctly, and diagnostic warnings to stdout have been reduced to a single line.
parselog: the limit of 1000 lines has been removed from the tablebuffer, so arbitrarily large tables can now be processed (within system limits).

Changes in 0.0.13

Bug fix: fix len(fragment) function so that it always returns zero or greater.

Changes in 0.0.12

Bug fix: patternmatch.isccp4banner() has been updated to recognise CCP4 version numbers that include a trailing lower-case letter e.g. 6.0.99a. Previously programs with these version numbers did not have their banners identified by smartie.
The number of lines in the logfile that belong to a fragments (or program) object can be obtained by using the len function, e.g. len(fragment).

Changes in 0.0.11

logfile class: the "source" log file name is now stored as an absolute path.
Updates to the smartie directory structure: example logfiles are now in the "logfiles" subdirectory.
"test" subdirectory contains a basic set of unit tests for some of the pattern matching functions in smartie - it can be run using python test/test_smartie.py.

Changes in 0.0.10

show_summary.py: example script that uses smartie to process a logfile and then prints the text enclosed in summary tags.
$TABLE recognition failed for some examples where the table title contains ":" characters before the end of the title string. In this case smartie would complain that it couldn't process the table - now it applies a second matching step with fewer constraints before giving up.
table.jloggraph(): now escapes special HTML characters in table and graph titles when generating applet code, including double quotes " - otherwise these can cause problems with JLogGraph's parsing of the result table.
escape_xml_characters(): now also escapes double quotes and replaces them with ". Updated to use "replace" rather than regular expressions (Kevin Cowtan).
parselog: buffer size increased to 50 lines, which means that features of up to 50 lines can be recognised (used to be 10) (Kevin Cowtan).
patternmatch class: methods for detecting program banners and termination messages now have an additional fast test which skips the full regular expression tests if not passed - this has resulted in a significant speed-up (Kevin Cowtan).
retrieve functions: this has been modified and should be faster now for big files. Previously it was slow due to sub-optimal usage of the linecache module.
table.show(): always printed tables as $GRAPHS, even if they were originally $SCATTER plots - now fixed.
Bug fix: table_column.append() always converted numerical values to integers, regardless of whether they were actually integers or floats. Now floats should be properly treated.
table_graph.graphcols(): new method that returns a list of the column names that make up the graph.
table class: new methods to make it easier to build tables from scratch, specifically:
table.list_columns(): return a list of the column names defined in the table
table.add_data(): add a "row" of data to the table
table.definegraph(): add a new graph definition to the table based on the existing columns.
The table_example() function has also been updated to show how this new methods can be used to make a new table.

Changes in 0.0.9

tokenise() function: now recognises either single or double quotes as token delimiters (and quotes of one type can contain quotes of the other type).

Changes in 0.0.8

Minor modification to parselog, to set the start line for "unknown" program fragments to be immediately after the end of the previous fragment (relevant if you are using the retrieve() methods of programs and fragments).
Bug fix: escape_xml_characters() raised an exception if the supplied data was not a string, which caused problems with the table.html() method (now fixed).
Bug fix: table.jloggraph() method did not include the $TABLE.. and other tags when generating the HTML code, so graphs would not display correctly (now fixed).
Significant update to docstrings to provide more extensive documentation in smartie.html.
strip_logfile_html() updated to extract CCP4 formatted tables from <param...> used for JLogGraph displays.
Removed report_example() function.
Added new "summary" class, which describes the location of a block of summary text within a logfile that is marked up with CCP4 summary tags. logfile objects can store summary objects and parselog() will add them in the order that they are located in the file.
summary objects are not yet assocated with fragments.
New function copyfragment() allows one fragment or fragment subclass to be populated from another.
logfile class: new method fragment_to_program() will convert an existing fragment object stored in the logfile object to a program object, at the same time updating the program lists appropriately.
program class: added new methods addlogicalname(), logicalnames() and logicalnamefile() that allow the storage and retrieval of the logical name/file name pairs found in the program logfile.
The parselog() function now recognises the file opening reports in the logfile and adds the data to the program object automatically.

Changes in 0.0.7

parselog() recognises keyworded input lines (i.e. logfile lines starting with a " Data line---" preamble) and stores the lines themselves in the appropriate program objects. The list of keyword lines can be retrieved using the program.keywords() method.

Changes in 0.0.6

Table values are now stored as floats or integers as appropriate. Items that cannot be converted to numerical values are stored as strings, as before.
table object now has an nrows() method that returns the number of rows of data in the table. The table_column object also has a new nrows() method that returns the number of rows in a single column.
logfile and fragment classes have new tables() method, which by default returns the list of tables for that object. It can also return a subset of tables that match a regular expression for their titles, and thus supercedes the existing findtable() methods in those classes. The tables() method also supercedes the existing table() method, which is now deprecated.
Bug fix: findtable() method in logfile and fragment classes didn't work correctly, this is now fixed - see updated documentation. However, findtable() is now deprecated in favour of the new tables() methods in those classes.
Bug fix: fixed fragment class so that pickle.dump() can be used to serialise a smartie logfile object (previously caused a crash)

Changes in 0.0.5

Bug fix: handle cases where the logfile being processed doesn't appear to contain any recognisable features such as program banners or tables (previously this caused a crash).
Bug fix: handle log files with DOS-formatted line endings.
Updated to deal with old-style (pre-5.0) CCP4 program banners and termination messages (both 4.1 and 4.0).
Updated to deal with Phaser 1.3.3 program banner, which appears to differ from the 1.3.2 version.
Table processing improved to deal with tables that don't conform to the loggraph $TABLE specification.

Changes in 0.0.4

table class: the findtable() method has an optional second argument that specifies where in the list to start the search from - see module documentation for more detailed information.
table class: there is a new html() method that generates a HTML table with the column titles and data.

Changes in 0.0.3

Bug fix: parselog now recognises CCP4 program names that contain bracket characters (previously failed to recognise names like "MOLREP(ccp4)9.0.08"). Added new example file 3_molrep.log which originally exposed this problem.
Bug fix: parselog code was broken for handling incomplete program log files where the start banner is missing but the termination message is present, this is now fixed. Added new example files scala_nobanner.log and molrep_nobanner.log, which demonstrate this situation.
Added findtable() methods to logfile, fragment and program classes. This allows a table to be retrieved by specifying a regular expression which matches the table title.
The function of the table.show() has changed; by default it now returns only the table body (column titles plus data) without any additional markup. The behaviour in the previous version can be obtained either by specifying the loggraph argument as True, or by using the new table.loggraph() method - both of which return the table data formatted as a loggraph table.

Changes in 0.0.2

The logfile class now has a list of fragment objects, and the program object list is used exclusively for referring to logfile fragments that are from programs. There are new methods logfile.nfragments() and logfile.fragment(i) to access the fragment list, and the summarise function also includes summaries of the fragments.
The method for accessing attributes of a program object has changed, so now instead of e.g. program.name() you must use either program.name (i.e. no parentheses), or program["name"], or program.get_attribute("name") (the first two methods simply wrap the last one).
The change in implementation means that there are no longer explicit methods defined in the program class to tell you what the available attributes are; instead use the program.attributes() method to get a list, and see the documentation in smartie.html for descriptions of each attribute.
There is now a way to access the data in a table column by referencing its name, using the new col() method of the table class - for example table.col("Rfree") returns a list of the values in the "Rfree" column of the table, and table.col("Rfree")[-1] references the last value.
Added support for retrieving the text for a fragment directly from the logfile - the fragment, program and ccp4i_info classes all support the retrieve method to allow the text to be fetched from the file and returned as a string.
Fixed a bug in the jloggraph method of the table class (only the last line was ever returned).

Acknowledgements

Thanks to Ronan Keegan, Wendy Yang and Martyn Winn for providing useful input to the development of smartie.

Kevin Cowtan has also provided code changes and useful feedback.

Author

Peter Briggs, 2006-8