.\" Automatically generated by Pod::Man 2.22 (Pod::Simple 3.13) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is turned on, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .ie \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . nr % 0 . rr F .\} .el \{\ . de IX .. .\} .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "PMP::PMP 3" .TH PMP::PMP 3 "2009-02-28" "perl v5.10.1" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" PMP \- Poor Man's Pipeline; programmatic pipeline control .SH "SYNOPSIS" .IX Header "SYNOPSIS" The best synposis is found in the tutorial: http://wiki.bic.mni.mcgill.ca/index.php/PoorMansPipeline .SH "DESCRIPTION" .IX Header "DESCRIPTION" \&\s-1PMP\s0 stands for \*(L"Poor Man's Pipeline\*(R" and is a perl library that provides control over arbitrarily complex commands linked through dependencies. The main goals of \s-1PMP\s0 are: .IP "\(bu" 4 Execution of a set of commands describing a pipeline .IP "\(bu" 4 Tracking of dependencies between the different commands .IP "\(bu" 4 Parallel execution mode by using one of two batch queueing system .IP "\(bu" 4 Drop in replacement of parallel or sequential modes. .IP "\(bu" 4 Generation of dependency graphs for easier debugging. .IP "\(bu" 4 Full programmatic control over the pipeline. I.e. it is designed as a series of perl classes rather than a separate language. The key advantage to the approach that \s-1PMP\s0 takes is that it makes it possible for generic pipelines to be written since argument parsing and all control structures of Perl are available to the user. .IP "\(bu" 4 Easily customizeable through the use of inheritance. Use a pipeline that calls on a batch queueing system or not \- by changing one line of code. .PP The main features currently not present which might be added in the near future are: .IP "\(bu" 4 Use of a database to track dependencies and pipeline status. Using a database rather than the filesystem is a blessing in that it can allow for faster execution times since there is much less file access, and a curse in that it makes an application much less portable. .SH "COMPONENTS" .IX Header "COMPONENTS" \&\s-1PMP\s0 currently consists of four different classes: .IP "\(bu" 4 \&\s-1PMP::PMP\s0 .Sp The main class which is used to configure a pipeline. A pipeline is, for the purposes of \s-1PMP\s0, defined as a the set of commands and their dependencies for a single subject. .IP "\(bu" 4 PMP::spawn .Sp A subclass of \s-1PMP\s0 in which the command execution uses the MNI::Spawn batch system. Otherwise should be entirely exchangeable with \s-1PMP::PMP\s0. .IP "\(bu" 4 PMP::pbs .Sp A subclass of \s-1PMP\s0 in which the command execution uses the \s-1PBS\s0 batch queueing system rather than the MNI::Spawn interface. Otherwise should be entirely exchangeable with \s-1PMP::PMP\s0. .IP "\(bu" 4 PMP::sge .Sp A subclass of \s-1PMP\s0 in which the command execution uses the \s-1SGE\s0 (Sun Grid Engine) batch queueing system. Otherwise should be entirely exchangeable with \s-1PMP::PMP\s0. .IP "\(bu" 4 PMP::Array .Sp Designed to deal with a set of pipelines. Most pipeline runs will consist of multiple subjects executing the same set of commands \- PMParray is designed to make that easy. .SH "OVERVIEW" .IX Header "OVERVIEW" The usual way of setting up a \s-1PMP\s0 pipeline is the following: .PP Import the necessary components through the use statement, e.g.: .PP .Vb 5 \& use PMP::PMP; \& use PMP::spawn; \& use PMP::pbs; \& use PMP::sge; \& use PMP::Array; .Ve .PP The pipearray is also declared at this early point: .PP .Vb 1 \& my $pipes = PMP::Array\->new(); .Ve .PP Then comes any argument processing that your application might have to deal with as well as setting up some global variables that will remain unchanged for each pipeline. This is followed by the definitions of each individual pipeline, usually placed inside a foreach loop which processes each subject. Inside this loop the pipeline is initialised like so: .PP .Vb 4 \& my $pipeline = PMP::PMP\->new(); # sequential version (default spawn) \& my $pipeline = PMP::pbs\->new(); # parallel version using PBS \& my $pipeline = PMP::sge\->new(); # parallel version using SGE \& my $pipeline = PMP::spawn\->new(); # sequential version using MNI::Spawn .Ve .PP Then certain globals for that pipeline are set, such as .PP .Vb 2 \& $pipeline\->name("some\-name"); \& $pipeline\->statusDir("/some/directory"); .Ve .PP This makes a good place also for defining variables that change for each subject, such as input and output filenames. .PP This is followed by defining all the stages through the addStage method, an example of which is: .PP .Vb 6 \& $pipeline\->addStage( \& { name => "total", \& label => "this does something interesting", \& inputs => [$filename], \& outputs => [$talTransform], \& args => ["mritotal", $filename, $talTransform] }); .Ve .PP This same stage can also be written more concisely: .PP .Vb 3 \& $pipeline\->addStage( \& { name => "total", \& args => ["mritotal", "in:$filename", "out:$talTransform"] }); .Ve .PP After all the stages have been defined some further initialisation commands can be run: .PP .Vb 6 \& # compute the dependencies based on the filenames: \& $pipeline\->computeDependenciesFromInputs() \& # update the status of all stages based on previous pipeline runs \& $pipeline\->updateStatus(); \& # restart all stages that failed in a previous run \& $pipeline\->resetFailures(); .Ve .PP Then the pipeline can be added to the Pipearray: .PP .Vb 1 \& $pipes\->addPipe($pipeline); .Ve .PP The foreach loop can then be closed and the pipeline itself run: .PP .Vb 2 \& # loop until all pipes are done \& $pipes\->run(); .Ve .SH "PUBLIC METHODS" .IX Header "PUBLIC METHODS" .SS "new" .IX Subsection "new" Initialises a pipeline. Has to be the first method called. Takes no arguments. .SS "addStage" .IX Subsection "addStage" Adds a stage definition to the pipeline. Takes a hash as an argument. The hash has the following components: .IP "\(bu" 4 name .Sp The name of that particular stage. The name is what will be used to address this stage for later usage (such as dependency tracking). .IP "\(bu" 4 label .Sp A description of this stage. Entirely optional, and is only used when generating dependency graphs. Some formatting codes are allowed, especially for newlines: use \e\en. .IP "\(bu" 4 inputs .Sp An array of the input filenames. Input files can be specified explicitly in this array or within the args statement (see below). Inputs and outputs can be used to define relationships between stages. .IP "\(bu" 4 outputs .Sp An array of output filenames. Output files can be specified explicitly in this array or within the args statement (see below). .IP "\(bu" 4 sge_opts .Sp A string which is directly passed to qsub when using the \s-1SGE\s0 execution mode (and is ignored otherwise). The following string \*(L"\-l vf=2G\*(R" would, for example, reserve 2 gigabytes of memory. .IP "\(bu" 4 args .Sp An array containing the actual command that will be run when this stage is executed. The first element is the program name, the following the options and filenames in the same order as that program needs them. If an option is prefixed with either in: or out: (i.e. \*(L"in:$filename\*(R") it is considered to be an input or output to/from this stage. .IP "\(bu" 4 prereqs .Sp An optional array of stage names upon which this current stage depends. Dependencies can also be computed based on relationships between the inputs and outputs of different pipeline stages. In that case only stages which would not be included through that mechnism should be added manually to the prereqs array. .IP "\(bu" 4 shellquote .Sp An optional boolean variable (0 or 1) which specifies whether shellquoting should be used in this stage. Only makes a difference for PMP::pbs and PMP::sge at this moment. By default shell-quoting is turned off; this flag has to be set for each stage which should use shell-quoting. .PP An example of adding a stage would be: .PP .Vb 5 \& $pipeline\->addStage( \& { name => "cls", \& label => "does something else that is interesting", \& args => ["classify_clean", "\-clobber", "\-clean_tags", \& "in:$final", "out:$cls"] }); .Ve .SS "statusDir" .IX Subsection "statusDir" Gets or sets the directory in which status files are placed. Status files are used to keep track of each stage's completion status as well as whatever messages the running of that stage produced. The following files can thus be created for each stage during the processing of a pipeline: .IP "\(bu" 4 statusDir/pipelineName.stageName.running .Sp An empty file that is created while the stage is running or has been submitted to the batch system. This file is removed once the stage completes or crashes. .IP "\(bu" 4 statusDir/pipelineName.stageName.finished .Sp An empty file that is created when a stage has completed successfully. .IP "\(bu" 4 statusDir/pipelineName.stageName.failed .Sp An empty file that is created when a stage has existed with any value other than zero. .IP "\(bu" 4 statusDir/pipelineName.stageName.log .Sp A file that is created once a stage has finished and which holds the messages printed to stdout and stderr during the execution of a job. .SS "name" .IX Subsection "name" Gets or sets the name of the pipeline (if an argument is supplied than it sets the name to that argument). .SS "debug" .IX Subsection "debug" Gets or sets whether debug messages will be printed. A value of 0 turns debugging off, anything else turns it on. .SS "printUnfinished" .IX Subsection "printUnfinished" Prints the unfinished stages. If no arguments are supplied it prints them tersely, if an argument is supplied it gives more detail about each stage that is still unfinished. .SS "computeDependenciesFromInputs" .IX Subsection "computeDependenciesFromInputs" Uses the input and output files of all stages to compute between stage dependencies. Should be called after all stages have been added and before the pipeline is executed. .SS "statusFromFiles" .IX Subsection "statusFromFiles" Sets the status of each stage based on its inputs and outputs (as specified in addStage). A stage will be considered to have finished if both the outputs and inputs exist and if the outputs are newer than the inputs. .SS "updateStatus" .IX Subsection "updateStatus" Updates the status of each stage based on the status files. Should be called after all the stages have been added and before the pipeline is executed. .SS "registerPrograms" .IX Subsection "registerPrograms" Registers all the programs used in the pipeline. The assumption is that the first element of the args array that is passed to addStage contains the program name. A benefit of registering the programs is that \s-1PMP\s0 will die if any of the programs cannot be found on the environment. .SS "run" .IX Subsection "run" Run one iteration of the pipeline. Returns a value of 0 when the pipeline has no more stages that can be executed. .SS "resetStage" .IX Subsection "resetStage" Takes a stage name as an argument and resets that stage's status so that it becomes runnable again. .SS "resetFailures" .IX Subsection "resetFailures" Resets all stages that have failed so that they can be run again. .SS "resetFromStage" .IX Subsection "resetFromStage" Takes a stage name as an argument and resets all stages from that stage onwards (including that stage itself). .SS "resetAfterStage" .IX Subsection "resetAfterStage" Takes a stage name as an argument and resets all stages after that stage onwards (excluding that stage itself). .SS "resetAll" .IX Subsection "resetAll" Resets all stages in the pipeline. .SS "resetRunning" .IX Subsection "resetRunning" Resets all stages thought to be running. .SS "createDotGraph" .IX Subsection "createDotGraph" Takes an filename as an input \- a graph description will be written to that file. One can use dot (a tool that is part of graphviz) to generate a graphical representation of the dependecies like so: dot \&\-Tps filename \-o output.ps. .SS "createFilenameDotGraph" .IX Subsection "createFilenameDotGraph" Takes a filename as an argument as well as optional third argument representing a substring to be removed from the filenames. It then creates a dot file for generating a graph of the filename dependenencies. .SS "printStatusReportHeader" .IX Subsection "printStatusReportHeader" Takes a filehandle reference as an argument, and prints a \s-1CSV\s0 separated header containing all the stage names to that file. .SS "printStatusReport" .IX Subsection "printStatusReport" Takes a filehandle reference as an argument, and prints the status for each stage in \s-1CSV\s0 format to that filehandle. .SS "printStage" .IX Subsection "printStage" Takes a stage name as an argument and prints information about that stage. .SS "printStages" .IX Subsection "printStages" Prints all stages in the pipeline. .SS "getPipelineStatus" .IX Subsection "getPipelineStatus" Gets the pipeline's status, returning one of four possible strings: .IP "\(bu" 4 \&\*(L"not started\*(R" This pipeline has not yet been started .IP "\(bu" 4 running This pipeline is running; also returns a list of the stages that are currently running. .IP "\(bu" 4 failed This pipeline has failed; also returns a list of the stages that have failed. .IP "\(bu" 4 finished This pipeline has finished. .SS "subsetToStage" .IX Subsection "subsetToStage" Takes a stage name as an argument, and creates a subset of stages running from the beginning of the pipeline up to that stage. .SH "SEMI-PRIVATE METHODS" .IX Header "SEMI-PRIVATE METHODS" In the good old perl tradition \s-1PMP\s0 has no private methods. The following methods listed here, however, are not really meant for the calling program. Most should not do any harm, but there is no guarantee. In other words, use at your own risk. .SS "stageStatusFromFiles" .IX Subsection "stageStatusFromFiles" Takes a stage as an argument and sets the status of that stage to finished if it has all inputs and outputs and the outputs are newer than the inputs. .SS "printDependencyTree" .IX Subsection "printDependencyTree" Prints the dependency tree. Sort of. The issue is that the dependency is both downwards as well as rightwards. In other words, there is a guarantee that when a stage appears in this tree that it does not depend on any stages to its right or below it. A bit hard to read, which is why this is still considered a semi-private method. .SS "sortStages" .IX Subsection "sortStages" Sorts the stages based on their dependencies. Gets called automatically when needed, so has no real place in user space. The order only guarantees that a stage does not depend on any of the following stages. .SS "isStageFinished" .IX Subsection "isStageFinished" Takes a stage name as an argument and returns true if the stage has finished. In \s-1PMP\s0 it checks first whether the status flag has been set to finished, and if not whether the finished file exists for that stage in the statusDir. Would have to be overwritten in a subclass that uses a database to track the pipelines status. .SS "isStageRunning" .IX Subsection "isStageRunning" Same as above but checks whether the stage is running. .SS "isStageFailed" .IX Subsection "isStageFailed" Same as above but checks whether the stage has failed. .SS "updateStageStatus" .IX Subsection "updateStageStatus" Takes a stage name as the argument and updates its status. Called automatically when needed and therefore has no place in userland. .SS "execStage" .IX Subsection "execStage" Takes a stage name as the argument and executes that stage. .SS "execAllStages" .IX Subsection "execAllStages" Execute all stages in one lumped job. .SS "getStatusBase" .IX Subsection "getStatusBase" Takes a stage name as an argument and returns the base for its status files. .SS "getRunningFile" .IX Subsection "getRunningFile" Takes a stage name as an argument and returns the running filename for that stage. .SS "getFailedFile" .IX Subsection "getFailedFile" Takes a stage name as an argument and returns the failed filename for that stage. .SS "getFinishedFile" .IX Subsection "getFinishedFile" Takes a stage name as an argument and returns the finished filename for that stage. .SS "getLogFile" .IX Subsection "getLogFile" Takes a stage name as an argument and returns the log filename for that stage. .SS "declareStageRunning" .IX Subsection "declareStageRunning" Takes a stage name as an argument and declares that stage to be running. Touches the appropriate filename. .SS "declareStageFailed" .IX Subsection "declareStageFailed" Same as above but for failure. .SS "declareStageFinished" .IX Subsection "declareStageFinished" Same as above but for successful completion.