# Release Notes for RTG Tools Below are the release notes for the full RTG suite, upon which RTG Tools is based. Not all features described below may be included in this product. RTG Core 3.8.3 (2017-08-02) --------------------------- This release primarily includes bugfixes and minor improvements: * rocplot: (GUI) Improvements to graph zooming, to allow stepping back to previous zoom levels as well as fully un-zooming. * rocplot: Improve the automatic curve naming heuristic to ignore directory name suffixes like "-eval", ".vcfeval" etc, and similar prefixes. * rocplot: Enable text antialiasing in GUI and PNG output. * vcfeval: More graceful handling of input VCFs containing REF values that are not valid according to VCF specifications. * vcfmerge/vcfeval: Normalize the casing of nucleotides in REF/ALT, which permits merging records where the REF/ALT differ in casing. * vcffilter: Graceful error handling of a new category of invalid javascript expression. * vcfsubset: Don't complain when using --keep-filter/--remove-filter flags with "PASS" and the VCF header doesn't contain a declaration for that filter. * misc: Prevent a unit test failure when running on newer versions of Ubuntu. Previous releases ================= RTG Core 3.8.2 (2017-06-20) --------------------------- This release primarily includes bugfixes and minor improvements: * vcfeval: Records where the REF/ALT contain bases not permitted by the VCF specification are now skipped (and reported in the log) rather than terminating execution. * vcfeval: (`combine` and `ga4gh` output modes only) These modes were inserting a redundant VCF header entry containing the command line, which has been removed. * vcfeval: GA4GH output mode now supports loose positional matching of variants (within +/-30bp by default, and adjustable via --Xloose-match-distance). * many: Prevent number formatting issues in non-English locales. The locale is now forced to US. * many: Some commands were not appending gzip termination blocks to VCF outputs, which could result in subsequent warning messages being produced by some third party tools. * many: Improve the consistency of exception handling in cases where the exception is thrown in a worker thread. * many: Attempting to supply file lists via shell process redirection would fail in non-obvious ways. File lists from process redirection are not currently supported and are now checked for up-front. * minor: When setting up rtg bash tab completion, issue a warning if an incompatible completion function has already been installed. (This can happen on some linux distros if you have the system `bash-completion` package installed and attempt to tab-complete rtg before installing rtg bash completion.) * minor: Fix a typo in the example configuration settings in rtg.cfg (specifically, RTG_JAVA_OPTS was incorrectly listed as RTG_JAVA_OPTIONS). RTG Core 3.8.1 (2017-05-29) --------------------------- This release primarily includes bugfixes and minor improvements: * rocplot: (GUI) The right hand panel now includes a visual indication of the color for each curve. * rocplot: (GUI) The color for a curve can now be set via color picker available from the per-curve context menu. * rocplot: (GUI) Reordering the curves is now achieved by drag and drop rather than the (now removed) reorder buttons. * misc: The RTG Tools release includes a scripts/demo-tools.sh that gives a quick end-to-end demonstration of simulation and VCF manipulation commands. This is similar in nature to the scripts/demo-family.sh script that is included in RTG Core. * vcfeval: Fix an exception caused by the skipping of heterozygous structural variants being dependent on the GT field allele ordering. These variants are now correctly skipped. In previous releases the cases that slipped through would enter matching with a stub allele representing the SV allele. * vcfeval: When running a sample-free comparison via the option `--sample ALT`, ignore records/alleles corresponding to structural variants. In 3.8 these could produce an exception, and in previous releases any SV alleles present were included as a generic token during matching. * vcfeval: Improve the handling of non-user exceptions encountered during VCF loading. Previously these would produce an often inscrutable message. * version: Update copyright year and include an alternative citation more appropriate for those using RTG Tools. * popsim: Now includes the random number seed in the VCF header for consistency with with other simulation commands. RTG Core 3.8 (2017-05-15) ------------------------- Major features of this release: * Improvements aimed at preprocessing and QC. In particular, RTG includes two new commands, fastqtrim and petrim, for preprocessing FASTQ files to apply various kinds of trimming before entering the NGS pipeline. These commands greatly expand what was previously available during data formatting. * The suite of simulation commands that were previously only available as part of RTG Core have been included in the RTG Tools package. These commands encompass simulation of reference genomes (genomesim), simulation of population-level variants (popsim), individual sample genomes using population variants (samplesim), simulation of samples as member of a pedigree obeying inheritance rules (childsim), simulation of de-novo variants (denovosom), generation of a genome given a VCF of sample variants (samplereplay), and read simulation according to a range of sequencer parameters (readsim/cgsim). * Initial support for accepting CRAM files as input to variant calling commands and most other commands that accept alignments as input. For some commands this may now require specifying a reference SDF in order to decode the CRAM files. * Improvements to the prebuilt AVR models that perform variant scoring. These models have been rebuilt using training data incorporating the latest truth sets produced by the GIAB initiative as well as improvements to the underlying machine learning algorithms. * User manual improvements, in particular the baseline progressions section has been rearranged to better illustrate how to run end-to-end RTG calling pipelines that make best use of RTG features such as sex-aware and pedigree-aware variant calling. Detailed changes are listed below by area. For more information on new features, see the RTG Operations Manual. ## Basic Formatting and Mapping * fastqtrim: This new command allows trimming of FASTQ files with much more flexibility and control than is available directly from format. See the user manual for more information and examples. * petrim: This new command allows trimming of read bases in paired-end data where read-through has occurred, as determined by alignment overlap. See the user manual for more information and examples. * format: Support for reading interleaved paired-end FASTQ added. This is useful for formatting directly from streamed output of the petrim command, avoiding additional disk I/O. * format/map: The quality encoding for FASTQ input files now defaults to the sanger encoding used by the majority of modern FASTQ files, and so the --quality-format flag typically only needs to be specified when processing older FASTQ files employing an alternative encoding. * many: When outputting FASTA/FASTQ, ensure consistent use of unix line endings across the various commands. * calibrate: When calibrating multiple BAM files, each is calibrated in an independent thread, obeying --threads flag. * sammerge: New flag --subsample that permits a fraction of the alignments through to the output. In addition, the new flag --seed lets you control which seed is used for this filtering. * coverage: Computes additional QC metrics fold-80 penalty and median coverage. * coverage: New flag --per-region to which changes how BED/BEDGRAPH coverage records are triggered, from being whenever the coverage level changes, to only when the region changes. * sammerge: Will now create output files in CRAM format if the output filename ends with ".cram". This requires the user to specify the reference SDF via the new --template flag. * index: Now allows creating indexes for CRAM files. These are the `.bai` indexes currently supported by htsjdk, rather than `.crai` indexes. ### Variant Calling * snp: Includes INFO.DP annotations in output VCF, for consistency with the existing multi-sample caller output. * family/population/somatic: New VCF annotations (OCOC/OCOF/DCOC/DCOF) that indicate the count/fraction of contrary evidence observed in the original(parent) vs derived(child) samples. * snp/family/population/somatic: These commands now support SAM/BAM files that make use of the '=' character in the SEQ field (such as can be created by BamUtil:convert) * snp/family/population/somatic: These commands now support CRAM files as input. * family/population: Improved error reporting for semantically incorrect user-supplied pedigree information. * snp/family/population/somatic: Improvements to the accuracy of the pre-built AVR models. These models have been rebuilt using training data incorporating the latest truth sets produced by the GIAB initiative as well as improvements to the underlying machine learning algorithm. * snp/family/population: The default AVR model is now illumina-wgs.avr (previously the default was illumina-exome.avr). For exome calling, the illumina-exome.avr model provides an advantage over illumina-wgs.avr only when the primary interest is maximising the scoring of variants called outside of exome target regions. * many: For compatibility with non-human species, sex handling of PAR regions has been extended to allow the length of a PAR region in each member of an allosome pair to be of different length. * svprep: Add the ability to run on merged alignment files rather than requiring alignment files to be separated into mated vs unmated vs unmapped. * svprep: New flag --no-augment flag permits the computation of read group statistics files only, for use when collecting statistics from third party alignment files. * avrpredict: New flag --sample to allow AVR scoring of only the specified sample names. * avrpredict: New flag --vcf-score-field to allow storing the AVR score into a format field with a different name, useful when comparing multiple scoring models. * avrbuild: Improvements to the quality of models built in the presence of missing annotations. ### Variant Processing and Analysis * vcfmerge: When combining records at the same position, vcfmerge will now not combine records at a site where some records use a VCF padding base (as required by the VCF specification to prevent REF or ALT being zero-length) and some records do not. This is because a record which utilizes a padding base is not making an assertion about the genotype of the padding base itself, and merging these records loses this semantic distinction. (The old behaviour can be obtained via --Xnon-padding-aware.) * vcfannotate: New flag --no-header to suppress output of the VCF header. * vcfsubset: New flag --remove-ids to allow clearing the ID column. * rocplot: New flag --zoom which allows the specification of an initial zoom to display. See the user manual for a description of the coordinate syntax. * rocplot: (GUI) Add ability to remove a curve via per-curve pop-up menu in the side-pane. * rocplot: (GUI) Prevent loading the same ROC data file multiple times, and improve error handling on invalid files. * rocplot: (GUI) Improvements to the open file dialog. Now defaults to displaying ROC data files only, permits opening multiple ROC data files at once via multi-select, and other minor changes. * rocplot: (GUI) The "Cmd" button now shows the command in a pop-up dialog rather than sending it to the terminal, which eliminates the need to search through multiple tmux windows to find where rocplot was started from. * many: Invalid VCF header contig length specifications are now reported gracefully. * many: Improved error reporting of general VCF header parsing errors, now include the problematic line where possible. * many: Improved error reporting of malformed GT fields. ### Metagenomics * species: Fix the handling of mappings that contain non-unique read-names (as could arise when mapping directly from FASTQ files as separate mapping runs and passing the resulting alignments to species). * species: Accuracy improvements when using paired-end data as the underlying data source. ### Other * pedstats: Improved the GraphViz pedigree visualization layout for normal pedigree structures. The old layout is available with the new ``--simple-dot`` flag. * many: The following simulation commands are now included as part of RTG Tools: genomesim, cgsim, readsim, popsim, samplesim, childsim, denovosim, samplereplay. * readsim: When using --taxonomy-distribution and --distribution, one of --abundance or --dna-fraction must be supplied in order to indicate the desired interpretation. * index: the -f flag is now optional and by default index will attempt to determine the file format by the extension. * many: Most commands accept the advanced flag --Xforce that allows them to continue in the case of pre-existing output files or directories. Be aware that particularly in the case of output directories the final directory contents may include files from previous runs (or even other commands), so this option should not be used in production scenarios. * many: Fixed an exception that could occur when performing multiple region based querying of SAM/BED/VCF records, where the regions were densely packed near the ends of chromosomes. * many: Almost all commands that take SAM/BAM as input now support CRAM files as input. Some of these commands have a new flag used to supply the reference SDF which is required when decoding CRAM. * misc: The rtg bash command completion has been improved to be more portable and no longer caches completion data on disk. * many: Linux and Windows packages have updated the bundled JRE to the latest from Oracle. RTG Core 3.7.1 (2016-10-18) --------------------------- This release primarily includes bugfixes and minor improvements: * map/cgmap: Addresses a pathological case where a particular paired-end read pair plus reference sequence could run for a disproportionately long time in highly repetitive regions. * vcfeval: Fixes a rare exception that could occur when a "too-hard" region occurs right at the end of a reference sequence. * rocplot: Fixes an exception that would occur when trying to plot the result of evaluating a call set (containing variants) against a baseline containing no variants. * rocplot: (gui) When loading several files on startup, sometimes the initial view would not be fully zoomed out. We now ensure that the plot is zoomed out after the initial files are loaded. * vcffilter: Fixes a regression in command line flag validation that would cause a talkback exception if no input file was supplied rather than presenting an appropriate message. * vcfmerge: Fixes an exception that could occur when merging a mixture of regular VCFs containing sample columns with sites-only VCFs. * bgzip: Fixes an exception that could occur when decompressing from stdin. * Minor documentation fixes. RTG Core 3.7 (2016-08-25) ------------------------- Major features of this release: * Improvements to mapping speed when aligning targeted sequencing data. This feature makes use of a per-reference hash blacklist which is constructed once per reference genome and can yield significant speed improvement. In addition, several changes were made to reduce peak memory use during mapping. * Variant callers now allow the optional inclusion of expected germline allele balance terms in the Bayesian model. In a genome-wide scale, this generally results in a reduction in false-positive calls, although sensitivity may be reduced for variants which do not follow allele balance expectations, such as mosaic de novo variants. * Several improvements to the somatic caller. These include the ability to enable output of germline variants (due to the joint calling, accuracy of calling germline variants during somatic calling is typically higher than separately calling germline variants from the normal sample alone). The somatic caller now has the ability to explicitly model the expected somatic allelic fraction, for use in cases where the tumor heterogeneity is expected to be low. Additional options allow the output of records at sites exceeding user-specified thresholds for non-reference evidence. We have also included an AVR model specifically built for somatic calling which provides more accurate scoring than the regular germline AVR models. * Several improvements to the variant comparison tools. vcfeval now includes the ability to evaluate matches across confident-region boundaries according to GA4GH recommended practise. vcfeval can be used to compare against "sample-free" VCFs such as ExAC/COSMIC/dbSNP, and the runtime has also been significantly improved. In addition, the rocplot command can now produce precision-sensitivity graphs, and can output SVG as a more publication-ready format. Note: RTG now requires Java 8, so for those using the "nojre" RTG download or who are building from source, make sure you have Java 8 installed. Detailed changes are listed below by area. For more information on new features, see the RTG Operations Manual. ## Basic Formatting and Mapping * format: Automatically installs reference genome configuration information when a recognized reference genome is being formatted to SDF. Also outputs a reminder for those cases where it looks like a reference genome is being formatted but which is not one of the recognized genomes. * sdf2cg: New command to allow the export of Complete Genomics data that has been formatted as SDF to Complete Genomics TSV read format. * map/cgmap: TLEN was not being correctly computed in the presence of soft clipping and back steps. This has now been corrected. * map/cgmap: Several reductions in peak memory use during mapping. * map: Significant speed improvement when mapping highly targeted sequencing data, using the mechanism of a repetitive hash blacklist. This is enabled via the new flag --reference-blacklist. A separate tool 'hashdist' is used for this one-off blacklist construction. * hashdist: New command that can be used to analyse the uniqueness of k-mers contained within a reference sequence and to produce a reference hash blacklist. * calibrate: New flag --exclude-bed and --exclude-vcf can be used to exclude sites of known genomic variation during the computation of calibration data. It is not currently possible to specify this information to the automatic calibration that is carried out during mapping, this will be added in a future release. ### Variant Calling * snp/family/population/somatic: These callers expect calling to be carried out on alignments that have had calibration information computed. They now requires the explicit use of the --no-calibration flag in order to proceed anyway. * snp/family/population/somatic: These commands now output a warning if too many "excessive coverage" situations are encountered, as this usually signifies that the user has incorrectly calibrated their mappings or has failed to supply an appropriate coverage parameter to the caller. In addition, these commands output a warning if it appears that calibration has not been computed from correct regions for targeted data. * snp/family/population/somatic: New flag --min-base-quality which allows explicit ignoring of base calls which do not meet the specified minimum phred quality score. These bases will be treated the same as an N and will not contribute to allele counts. The default is to consider all bases. * family/population/somatic: The semantics of --max-coverage has changed from being the total coverage across all samples, to being the average per-sample coverage. This flag is typically only used when running without calibration, and this change makes the default behaviour more scalable with varying numbers of samples. * snp: An explicitly specified --ploidy flag now overrides the ploidy obtained from reference genome configuration (if present). Previously the ploidy specified in the reference genome would take precedence. * snp/family/population/somatic: Fixed an incorrect (and sometimes non-deterministic) computation of the PUR FORMAT annotation. This does not affect primary calling but could result in changes in AVR score. * snp/family/population/somatic: Updated the Bayesian model to include a term for the expected allele balance. This is disabled by default, and can be enabled with the new flag --enable-allelic-fraction. This option gives improved precision for regular germline calling, but sensitivity to mosaic variants or those within CNV regions may be reduced. * snp/somatic: The new flags --min-variant-allelic-depth and --min-variant-allelic-fraction can be used to enable output at sites where these thresholds are met, even if the caller would not otherwise make a call. Note that this does not act as a filter to prevent the caller from output at sites where these thresholds are not met. * somatic: New flag --include-germline which instructs the somatic caller to also output variants which have been identified as germline variants. * somatic: New flag --enable-somatic-allelic-fraction which instructs the Bayesian model to include a term for the expected somatic allelic fraction in the calling. This flag is most appropriate when tumor heterogeneity is low. * somatic: A new pre-built AVR model is provided for somatic calling which provides better scoring for somatic variants than the regular AVR models. This new model, "illumina-somatic.avr" is selected by default by the somatic caller. ### Variant Processing and Analysis * vcfsubset/vcffilter: New flag --no-header which omits the output of the VCF header. * vcffilter: New option --keep-expr to allow filtering records based on simple JavaScript expressions with natural VCF field access. For example 'NA12878.DP > NA12892.DP' to select records from a trio call-set where the depth of NA12878 is greater than that of her mother. See the user manual for more information and examples. * vcffilter: New option --javascript to allow advanced filtering and other processing of the VCF file using powerful JavaScript filters. These scripts can contain initial setup, per-record actions, and end functions. See the user manual for more information and examples. * vcfeval: Specifying a sample name of ALT for either the baseline or call sample name instructs vcfeval to match against all possible non-ref diploid (or haploid if using --squash-ploidy) genotypes possible from the declared ALTs. This permits matching against a VCF that contains no sample column, for example to find hits against a sample-free VCF such as ExAC or COSMIC. * vcfeval: New flag --evaluation-regions, which adds support for matching across high-confidence/false-positive regions such as those supplied with GIAB or Illumina Platinum Genomes truth sets according to GA4GH recommendations. In summary, only matches against baseline variants within these regions count as true positives and only non-matched call variants made within these regions count as false positives. * vcfeval: Now outputs additional true positive statistics for the unweighted calls, so you can see the simple count of true positives in call representation. When computing precision, this uses the unweighted call count in the denominator, to reduce representation bias in the precision. * vcfeval: Significant speed increase (often 2x speed up for typical WGS comparisons). * vcfeval: New output mode 'roc-only' which skips the output of VCF files and only produces the ROC data files and summary metrics. This reduces run-time and the size of the output directories when doing many runs. * vcfeval: Command line score field specification permits INFO. form, for consistency with JavaScript expression notation, although the old form of INFO= is still supported. * rocplot: Added the ability to plot precision-sensitivity graphs via the new flag --precision-sensitivity. In the interactive GUI the graph type can also be changed on the fly via a dropdown chooser. * rocplot: Added the ability to output images in SVG format, both in non-interactive mode via the new flag --svg, and when saving images from the interactive GUI. * rocplot: Improved the default labelling of curves by including the score field if available. * rocplot: The curve palette size has been increased in order to allow easier differentiation when more than 8 curves are being displayed at once. * rocplot: (GUI) Fixed an annoying bug that could occur when trying to edit the title of the plot or of the curves. Several other minor GUI improvements have been made, such as the ability to use the mouse-wheel to scroll large lists of curves. ### Other * aview: Now defaults to showing base colors in the terminal. Use --no-base-colors to disable this. * aview: Better error handling for invalid SAM records. * aview: New flag --print-soft-clipped-bases to display soft-clipped bases. * chrstats: New flag --output-pedigree that can be used to create a default pedigree file based on the mappings of multiple samples, using inferred sample sex where possible. * many: In several cases where a flag could be specified multiple times, it is now possible to supply a comma separated list of values. These are indicated in the output of --help. * many: Most utility commands which write VCF files now do so asynchronously, often resulting in significant speed improvements. * all: The distribution now includes an HTML version of the operations manual in addition to the PDF version. * all: The minimum Java requirement for RTG is now Java 8. RTG Core 3.6.2 (2016-03-10) --------------------------- This release primarily includes bugfixes and minor improvements: * map: mapping very large numbers of reads in a single chunk or with low step size settings could exceed some internal datastructures, giving unpredictable results. An explicit check for these conditions has been added. * map: Reduction in peak memory use when mapping paired-end data. * vcfeval: Better error handling for variants which have triploid or higher GT (ploidy higher than 2 is not supported). * extract: Extracting multiple regions from SAM/BAM across different chromosomes could cause an exception. * rocplot: Improved error handling for yet more ways in which attempting to open a GUI from a headless server can fail. * rocplot: (GUI) Minor improvement to crosshair handling. RTG Core 3.6.1 (2016-01-25) --------------------------- This release primarily includes bugfixes and minor improvements: * coverage: Fixed an exception that could occur when supplying a reference SDF that did not contain all the sequences present in the alignments. * family: Fixed an exception that could occur when supplying a family pedigree involving members not present in the input mappings. * population: The COF/COC annotations for de novo calls that were recently added to the family caller are now also produced by the population command when appropriate. * map/cgmap: When mapping pre-formatted reads containing SAM read group information embedded in the SDF and the input format was explicitly specified as SDF via -F sdf, the read group info wasn't being picked up. This is now fixed. * vcfmerge: Speed improvement when merging VCF files containing a large number of contig header declarations. * many: Speed improvement when accessing indexed datafiles (e.g. BED/BAM/VCF) that were being filtered by very large sets of regions. * rocplot: Better error handling when trying to run the GUI on a machine where a graphics environment is unavailable. * rocplot: (GUI) Update frame title when graph title changes. RTG Core 3.6 (2015-12-07) ------------------------- Major features of this release: * Further improvements to somatic variant calling which significantly reduce the number of false positive calls while retaining somatic calling sensitivity. These improvements are achieved by incorporating the presence of somatic-allele-supporting evidence in the normal into the Bayesian computation. Additional VCF annotations quantifying these "contrary observations" are included in the output. * De novo variant detection in families and pedigrees now incorporates similar techniques for a reduction in false positives. * Our support for aligning and variant calling with reads produced by Complete Genomics Inc has been extended to their newer 29 base-pair read structure (these reads consisting of 10-9-10 sub-reads are often represented as 30 base-pairs with a redundant N). * Several improvements to variant comparison with vcfeval, including the improved handling of call sets containing overlapping variants, and the ability to select alternative output modes depending on the desired analysis workflow. Detailed changes are listed below by area. Please read these through fully, as some command-line flags have changed, so updates to your pipeline scripts may be required. For more information on new features, see the RTG Operations Manual. ## Basic Formatting and Mapping * cg2sdf: Add support for formatting CGI TSV reads files containing their version 2 reads. These reads are typically represented as 30 base-pair arms (10-10-10 subread structure containing a redundant N which is removed during formatting), although 29 base-pair arm representation (10-9-10 subread structure) is also supported. * sdf2cg: This new command allows exporting SDF formatted Complete Genomics read data to their TSV reads file format. * cgmap: Now supports aligning the version 2 read structure. When aligning CGI reads, an appropriate indexing mask must be selected which is appropriate for the type of reads being mapped, so --mask is now a required flag. * cgmap: Mask names have been changed to more clearly indicate which version of CGI reads they are applicable to. Available masks are now "cg1" (formerly named "cgmaska15b1"), "cg1-fast" (formerly named "cgmaska1b1", and "cg2" (a new mask for use with version 2 reads which roughly equivalent in sensitivity to "cg1-fast"). Additional masks may be available in future. ### Variant Calling * somatic: Features an improvement to the Bayesian calculation to better account for the presence of contrary evidence. This has resulted in a large reduction in false positives while maintaining sensitivity. * population/family: These pedigree-based callers now contain similar adjustments to the Bayesian calculation to better account for contrary evidence of de novo variants. This has resulted in a large reduction in false positive de novos while maintaining sensitivity. * somatic/family/population: These callers produce additional annotations in their output VCF that indicate the degree of contrary observations for the novel allele. The COC annotation contains a simple count of the number of contrary observations and the COF annotation contains the contrary observations as a fraction of total observations. Users who wish to adjust the sensitivity/precision tradeoff of their de novo call sets may wish to use these attributes for filtering. * family/population: The marking of equivalent complex calls was not functioning for sex-aware calling on the Y chromosome when both males and females are present, resulting in occasional additional equivalent but differently represented variants present in the output. * population: Better error handling when a the user supplies a pedigree that contains cycles. * avrbuild: The new COC and COF annotations are now available as derived annotations that can be used in model building. One interesting use of these attributes may be to build AVR models specifically for predicting the correctness of de novo predictions. * snp/family/population/somatic: These variant callers all now include support for CGI 29 base-pair read structure. * snp/family/population/somatic: The pre-built AVR models distributed with RTG have all been rebuilt using current annotations and updated training data. ### Variant Processing and Analysis * vcfannotate: New option --relabel allows sample names in a VCF to be changed. * vcfsubset: New flag --remove-qual to reset the QUAL field to '.' * vcfsubset: Fixed a bug where encountering a VCF record that did not contain any FORMAT field specified in --keep-format would cause all subsequent records to be dropped. * vcffilter: For convenience the existing flags --keep-format, --remove-format, --keep-samples, etc. now support comma separated lists, For example: --keep-format GQ,AVR. * vcffilter: New flag --remove-hom to exclude records where a sample was called as homozygous. * vcfeval: New additional output modes that allow the selection of output files that best suit the desired workflow. These are controlled via --output-mode flag and there are currently three options available: split (the default, equivalent to previous behaviour), annotate (outputs baseline and calls files augmented with match status annotations), and combine (provides a simple side-by-side two-column VCF). For more information, see the user manual. * vcfeval: Removed option --baseline-tp, as the output of the baseline version of true positive variants is now always performed. When using the default (split) output mode, these are output to tp-baseline.vcf as before. * vcfeval: Added the ability to detect those FP and FN which have common alleles (e.g.: zygosity errors). Previously this could be done manually by running vcfeval a second time using --squash-ploidy on the fp.vcf and fn.vcf of an initial comparison, but now it is automatically performed when running the new annotate or combine output modes. * vcfeval: New flag --ref-overlap to allow matching variants where the alleles would overlap as long as the overlap bases are the same as ref. Unambiguous VCFs should not need this option, but such cases can arise when using unsophisticated callers or VCF merging tools. * vcfeval: Weighted ROC files now include a final data row that includes the statistics corresponding to no threshold application (and this includes any variants that were processed during path finding but which do not contain any ROC score field). In an ROC plot, this final point may be visible as a "tick" at the end of the curve. * vcfeval: The set of ROC data files that are produced are now for the following three subsets of calls: all calls, snps only, and non-snps only (e.g. indels, MNPs). Some users were doing separate runs of vcfeval on input sets filtered by category in order to get separate statistics for snps vs indels, an approach which is prone to misclassification of complex variants. * vcfeval: When processing multi-sample VCF files, it is now possible to specify different sample names for baseline vs callset, via the form: --sample baseline_sample,calls_sample. * vcfeval: Fixed a rare bug where if the input VCFs contained multiple variants with the same reference position and length, the output VCFs could contain the incorrect variant. * vcfeval: Fixed a crash that could occur when the input set contained a variant that extended off the end of the reference sequence. * rocplot: (GUI) Fix several minor issues: initial paint was not laid out correctly; very small ROC files would not display status info; some UI layout improvements; and add a small amount display padding. * rocplot: (GUI) Malformed ROC data files now show an error dialog. ### Metagenomics * similarity: This tool will now make use of available taxonomy information in the case of a single supplied SDF, in order to allow the easy computation of a neighbour joining tree from a reference species database (or subset thereof). ### Other * sdf2fasta/sdf2fastq: New flag --interleave to permit output of paired end data to a single output in interleaved fashion (i.e. alternating left and right arms). This allows piping paired end data for simple command-line processing (although there is also sdf2sam which may be more applicable depending on the processing desired). * cgsim: Added support for simulating reads with the CGI version 2 read structure, controlled via a new flag, --cg-read-version. * readsim: Add support for both versions of CGI read structures. Use --machine complete_genomics (the original 35 base pair read structure) or --machine complete_genomics_2 (the newer 29 base pair structure). * aview: New flag --unflatten to display unflattened CGI reads when present. At present only version 1 reads can be displayed in unflattened form. * misc: bash completion for RTG commands and options now works on Mac OS X (see scripts/rtg-bash-completion for instructions). * misc: The underlying htsjdk library used for SAM/BAM support has been updated to version 1.141. * many: The JRE bundled with Linux/Windows builds is now 1.8. RTG Core 3.5.2 (2015-10-15) --------------------------- This release primarily includes bugfixes and minor improvements: * many: When piping results from one command to another, and a later command closes the pipe (e.g. head), this scenario no longer produces an "Broken pipe" error message. This is consistent with the behaviour of commonly used command-line tools. * rocplot: Updated to handle ROC data files that contain lines with non-numeric score field. (In particular, future versions of vcfeval will include additional data-points corresponding to variants with no score provided) * rocplot: (GUI) Improvement to usability for curve renaming. Now a single-click in the curve title area enters edit mode, with RETURN/TAB to accept, ESC to cancel. * rocplot: (GUI) Add a button that prints an equivalent command line to the terminal, for easy restarting with similar state, particularly if curves files have been added interactively.. * cgmap: Fix for sample sex being ignored when supplied via a pedigree file rather than using explicit sex flag. * misc: Removed vestigial (and in RTG Tools' case, incorrect) "Licensed to:" line from the version command output. * misc: Add BSD license text to the RTG Tools distributable zip. RTG Core 3.5.1 (2015-09-07) --------------------------- This release primarily includes bugfixes and minor improvements: * coverage: Fix an exception that could occur if running with a reference SDF supplied that had chromosomes in a different order compared to the BAM sequence dictionary (typically this could occur when running coverage on third-party BAMs) * extract: When extracting multiple regions these regions are now sorted. * vcfeval: When an entire chromosome contained only baseline or only called variants, the summary statistics for FP/FN were not being incremented correctly. * vcfeval: Fixed a case where path-finding could get confused and drop variants. * vcfeval: Speed improvement in post-processing. * many: Improved error reporting for commands that involve processing multiple BAM files, so that the name of the particular file causing the problem is included. * wrapper: Fixed the java version number check so that it works correctly with openjdk 1.8 RTG Core 3.5 (2015-07-16) ------------------------- Major features of this release: * Several improvements to somatic calling, including the ability to specify site-specific somatic priors, control of output for gain-of-reference and loss-of-heterozygosity events, and changes to the VCF according to TCGA VCF specification: https://wiki.nci.nih.gov/display/TCGA/TCGA+Variant+Call+Format+%28VCF%29+1.2+Specification+-+Unofficial Note that these changes in VCF format compared to previous versions may require users to update their existing scripts for the changes. * Improvements to variant evaluation with vcfeval, primarily the ability to perform evaluation restricted to individual regions or sets of regions (for example GiaB high-confidence intervals or exome target regions), as well as the inclusion of more accuracy metrics, both as a new summary file and included in the weighted ROC data file. * Improvements to metagenomic species reference database management. Several new options allow better customization of a species reference, and extraction of genomic information for individual species contained within the reference database. Detailed changes are listed below by area. Please read these through fully, as some command-line flags have changed, so updates to your pipeline scripts may be required. For more information on new features, see the RTG Operations Manual. ### Basic Formatting and Mapping * format/map: When formatting or mapping reads supplied as SAM/BAM input data, any alignments marked as supplementary are ignored. Note that if the input data has already been aligned, it is recommended that the BAM file be shuffled to avoid biases during mapping arising from the data being presented in chromosomal order. See the user manual for more information. * sdf2fasta/sdf2fastq: These commands have new flags --names and --id-file that operate the same as their counterpart in sdfsubset. * sdfsubset: This command has new flags --start-id and --end-id that allow specifying a range of sequences by ID. * sdf2sam: This new command to allows the extraction of reads from SDF in the form of unaligned SAM/BAM. This has a benefit over extraction as FASTQ in that some metadata (such as read group information) is preserved, paired end data is stored in a single file, and quality encoding is inherent in the format. * chrstats: Reduce false positives in sex inconsistency detection that were due to applying the (tighter) sex-chromosome threshold also to autosomes. This threshold is now applied to sex-chromosomes only. ### Variant Calling and Analysis * somatic: Now allows the user to specify a BED file containing per-site somatic priors, which can be used (for example) to reduce the somatic prior at sites typical of false positives (e.g. presence in dbSNP) or increase the somatic prior at sites known to harbour somatic variants (e.g. presence in COSMIC). For more information see the user manual. * somatic: At the end of variant calling, the somatic caller produces an estimate of somatic sample contamination. Previously this estimate was only available in the log file, but in this release this computation has been greatly improved, and the contamination estimate is now included in the standard summary statistics. * somatic: "Gain of reference" calls are now disabled by default. These can be included by specifying the new flag --include-gain-of-reference. * somatic: Calls that are indicative of loss of heterozygosity (LOH) calls are not produced by default (since loss of heterozygosity analysis is most useful in conjunction with additional data such as germline variant calls or CNV data). These calls can be produced if desired by specifying --loh with a prior greater than 0). * somatic: When LOH calls are enabled, previously they were output in haploid GT representation, now they use the ploidy appropriate for the chromosome (according to the reference), for compatibility with downstream processing tools. * somatic: VCF output changes to bring the somatic representation in line with TCGA 1.2 VCF specification. In particular: * Calls include a new FORMAT field SS that indicates the somatic status for the derived (tumor) sample. This field replaces the previous SOMATIC INFO field. * Calls include a new FORMAT field SSC which contains the somatic score for the derived (tumor) sample. This field replaces the previous RSS INFO field. * lineage: Supports the input of pedigree in the form of VCF header annotations as output by the somatic caller, in the form: ##PEDIGREE= * population: Fixed a rare case where sometimes after complex call simplification, the only sample genotype containing a non-ref allele was a member of the pedigree not being output, and in this case the QUAL score was the 10log10 prob(no variant) rather than 10log10 prob(variant) as required by the VCF specification. This has been addressed. * vcfmerge: Added a new flag --force-merge-all to always attempt to merge headers containing conflicting descriptions. * vcfmerge: Previously vcfmerge would not process records containing symbolic alleles. These are now accepted. * vcfmerge: More graceful handling when encountering records with a GT that refers to a non-existent ALT. * vcfeval: Now outputs a summary containing various accuracy metrics. A first set of statistics is computed from the full set of variants evaluated (these will typically have highest sensitivity but potentially poor precision if the input call set has not been filtered). A second set of statistics is computed based on the ROC curve information, selected at a threshold which maximises the F-measure statistic (this provides some balance between sensitivity and precision, so may be a fairer point to gather statistics for cross-caller comparison). * vcfeval: The weighted_roc.tsv file now includes columns containing additional accuracy metrics. * vcfeval: Improved the detection that alerts the user when chromosome names are incompatible between reference, baseline, calls, and bed regions (if used). Improvements to other error and warning messages. * vcfeval: Added a new flag --bed-regions to supply a BED file containing a list of regions that the VCF records must overlap with in order to be included in analysis. For example, a common use case is to restrict to only evaluating calls contained within the GIAB high-confidence regions, or only within regions corresponding to exome target regions. * vcfeval: Added a new flag --region to specify a single region to evaluate variants within. This is useful when evaluating calls on a single chromosome or within a small region of interest. * vcfeval: Fixed a case where a ref-only call (i.e. containing no alts) could get output instead of an indel with a padding base at the same position. * vcfeval: Disabled the output of slope analysis data files by default, as these are fairly special purpose (primary ROC files are still output). They can be re-enabled if desired by using the new expert/experimental flag --Xslope-files. * vcffilter: The --remove-all-same-as-ref flag now does not consider a sample with missing GT as being variant, since the intent of this flag is to retain only records where at least one sample is called as variant. * vcfannotate: Added two new flags --info-id and --info-description to allow specifying the name of the INFO ID and Description fields added to the header during annotation. These flags only take effect if the VCF header does not already contain an INFO declaration with that ID. ### Metagenomics * taxfilter: Added a new flag --subtree which allows selecting entire taxonomic subtrees for inclusion in the output taxonomy. * taxfilter: Added a new flag --remove-sequences to allow the removal of sequence data associated with specific taxon ids. * sdf2fasta: Added a new flag --taxons to allow interpreting any supplied ID as a taxon ID and all sequences assigned to such taxon ID will be output. This provides an easy way to extract genomic sequence for any species from the reference SDF. ### Other * genomesim: Added a new flag --prefix to specify a prefix for generated sequence names. * many: Update the base library used for SAM/BAM input and output to htsjdk 1.128. * many: VCF reading now detects cases where a header specifies a field declaration using an ID that is already in use, preventing duplicate header declarations. * extract: Fix a regression where extracting from VCF without any region specified would include the VCF header. RTG Core 3.4.5 (2015-05-22) --------------------------- This release primarily includes bugfixes and minor improvements: * somatic: If the input mappings contained unmapped records with assigned coordinates, these were erroneously being included as evidence, resulting in spurious calls when calling with non-zero contamination specified. * vcfeval: Implemented an algorithm optimization that permits the evaluation of situations that previous versions would skip over as being too-complex (primarily where there were long runs of abutting variants), as well as yielding a general speed improvement. * avrbuild: Add checks that the user has specified at least one VCF annotation for use as a predictor attribute. * vcffilter: Fix bug when filtering on the FILTER declared last in the header for files that contained inadvertent duplicate FILTER header declarations or containing an explicit declaration for the PASS filter. * rocplot: Minor improvements to file chooser handling, and also include F-measure as an additional accuracy statistic in the status bar. RTG Core 3.4.4 (2015-04-20) --------------------------- This release primarily includes bugfixes and minor improvements: * vcffilter: The --keep-filter and --remove-filter options now recognize '.' as a value that can be filtered on. For example, to keep only variants that have a FILTER column that corresponds to non-filtered, use -k . -k PASS. * vcfeval: Enabled skipping over more extremely complex edge cases that could otherwise cause exceedingly long computation times. * rocplot: Add the ability to click on a point within the graph to show in the status bar the true positives / false positives / precision / sensitivity scores equivalent to that point. * rocplot: The individual curve sliders that can be used to simulate the effects of various threshold cut-offs did not work very well for curves corresponding to scores with very wide ranges and non-uniform distribution (such as GQ and QUAL often are). These sliders are improved so they work better with these curves, and adjusting the sliders also displays accuracy metrics in the status bar, to aid in threshold selection. * rocplot: It was sometimes possible to zoom in to negative coordinates. * aview: Fix display of BED regions that do not have a region name contained within the BED file. RTG Core 3.4.3 (2015-03-19) --------------------------- This is primarily a bugfix release: * map: Fixed a crash that could occur when mapping without any sample sex specified but when using a reference genome containing chromosome sex information. * somatic: Fixed a rare crash that could occur when calling across blocks of Ns when the only hypothesis presented by the reads was a deletion of sufficient length. * vcfeval: Improved handling in situations where variants are so dense within a region that there are too many possible haplotypes to feasibly resolve. Previously operation would abort, now a warning is issued and both baseline and called variants within that region are ignored. RTG Core 3.4.2 (2015-03-02) --------------------------- This is primarily a bugfix release: * somatic: Fix a crash that could occur when calling across Ns in the reference. * snp/family/population/somatic: Under some circumstances, I/O exceptions could trigger a crash talkback rather than being presented as a regular user-level error message for the user to act on. * chrstats: Fixed inconsistent output destinations between single-sample vs multiple sample case, and do not create a log file for this command in the current directory. * chrstats: Detect when the user has not set up a reference SDF with chromosome specification information and provide an appropriate error message indicating how to correct the situation. * many: Improved error handling when requesting indexed region retrieval of BED/VCF/SAM files for coordinates outside the range that can be addressed by tabix/bam indexes. * many: Improved error handling when errors are encountered during VCF header parsing, providing more information on where the problem was. * many: Improved error handling when errors are encountered during tabix indexing. RTG Core 3.4.1 (2015-01-22) --------------------------- This is primarily a bugfix release: * snp/family/population: Fixed a crash that could occur when calling across blocks of Ns when the only hypothesis presented by the reads was a deletion of sufficient length. * snp/family/population: When calling across blocks of Ns, under some circumstances no variant call would be made. * snp/family/population: Extremely large GQ and DNP FORMAT values are now capped at the maximum permitted by BCF (2147483647). Previously, values above this could occasionally trigger a crash. * wrapper: Changes to streamline the first run configuration and to bring Unix and Windows wrappers closer to equivalence, including clearer instructions of how to customize initial configuration. Crash reporting is now opt-out rather than opt-in. * unix wrapper: When the operating system fails to allocate memory to the JVM (typically due to other memory-intensive processes running on the same machine) this is now presented as a user message, rather than triggering a crash report talkback. * many: input list files are now validated during loading rather than after loading the list. This gives much better error handling in the case where a user accidentally gives the name of an alignment file as an input list file. * Other improvements and cleanups to documentation. RTG Core 3.4 (2014-12-20) ------------------------- Major features of this release: * Added the ability to run variant calling only on a list of regions provided via BED file. This results in a large speed improvement when performing exome variant calling, by avoiding computation associated with off-target locations, as well as permitting fast variant calling of target sites from whole genome data, or running variant calling in haploid mode in areas of loss-of-heterozygosity. * Added the ability to perform variant calling for sites where the reference is unknown but where reads have been mapped. This can be used to fill in gaps in draft reference assemblies. This includes both sites where an N is observed in the reference, larger N-blocks where reads have been mapped spanning the N block, and large N-blocks where reads are anchored on one side by known reference. * Workflow improvements to human pipeline processing to identify mislabelled samples or incorrect pedigree. At the end of read mapping, average coverage levels across chromosomes are examined and a warning is issued if there appear to be gross chromosomal abnormalities or if the coverage levels do not match expected levels for the sex of the individual specified. A standalone tool for this is also provided. Similarly, the mendelian analysis tool now computes concordance with pedigree and issues a warning if low concordance indicates a parent or child is inconsistent with the supplied pedigree. In addition we have added two commands for manipulating, extracting information from, and summarizing pedigree files. * New commands for metagenomics taxonomy and reference database management. Previously using metagenomics databases other than those pre-built by RTG was difficult and error-prone. Three commands have been added to allow taxonomy construction starting from a NCBI taxonomy dump, filtering the taxonomy based on user criteria, and validating the structure of a metagenomics species reference database. Detailed changes are listed below by area. Please read these through fully, as some command-line flags have changed, so updates to your pipeline scripts may be required. For more information on new features, see the RTG Operations Manual. ### Basic Formatting and Mapping * map/cgmap/mapf: As an alternative to supplying --sex to specify the sex of the individual being mapped, you may specify a pedigree file containing the sex information for the sample. This requires you to have either formatted the read set with read-group information or to supply read group information at mapping time (the advantage of this feature is that it lets you minimize the number of command-line differences for each sample being mapped). * map/cgmap: When mapping using a reference containing sex chromosome information, average per-chromosome coverage information is used to issue warnings when it is likely that the incorrect mapping sex has been specified or if any autosomes have abnormal coverage levels (perhaps indicating a chromosomal abnormality). This feature requires you to be using a reference genome SDF containing chromosome information, as described in the RTG Operations Manual. * chrstats: New command to perform standalone average coverage reporting and checking against expected coverage levels from calibrated mapping files. This is essentially the same check that is performed during mapping, but allows multiple mapping files to be provided (either if multiple mapping runs were performed for a single sample, or for batch reporting for multiple samples). * calibrate: New option --merge to allow merging multiple alignment files into a single output file while performing calibration. For example, this can reduce the number of I/O operations needed to go from multiple, uncalibrated, unindexed third party input files to a single calibrated indexed BAM file. * calibrate: New option --threads to allow calibration of multiple files to use multiple cores. (Currently this option only takes effect when used with the --merge option, not regular multi-file calibration) ### Variant Calling * snp/family/population/somatic: New flag --bed-regions, adds the ability to only perform calling on the regions specified via a BED file. This is more efficient than applying BED filtering via --filter-bed. However note that the results can sometimes differ, due to edge effects of complex calling regions that cross region boundaries. * snp/family/population/somatic: Implemented variant calling across N's in the reference. (This was previously occurring in some cases where mappings across the N contain indels, but has now been fully implemented). Calls where the reference is not a valid allele due to containing an N are annotated with an NREF INFO tag for easy filtering, and neither contain QUAL or GL values. * snp: As an alternative to supplying --sex to specify the sex of the individual for variant calling, you may specify a pedigree file containing the sex information for the sample. This can reduce the number of command-line differences when processing multiple samples. * family/population/somatic: Better error handling when input mappings contain a record that does not correspond to one of the samples being called. * snp/family/population/somatic: Fixed a hang that could occur when trying to clean up after an out-of-memory error. * snp/family/population/somatic: Fixed a rare crash that could occur at the end of chromosomes. * somatic: Previously stored a somatic score indicating the likelihood of the variant being a somatic variant in the QUAL field. This is not strictly according to the VCF spec, so this score has been moved to the new NCS INFO field. * vcfannotate: The --fill-ac-an flag now does not add an AC annotation when no ALTs are present in a record. * vcffilter: New flag --region to extract and filter only the variants contained within a single specified region. * vcffilter: New flag --bed-regions to extract and filter only variants contained within the regions contained in a BED file. * vcffilter: Better error handling when applying criteria that require GT be present to files that are missing the GT field. * vcfmerge: The default behaviour has changed when merging variants at the same position where the ALTs are different and the variants contain FORMAT fields that cannot be automatically be merged (Number=A,G,R, or the special case of the AD FORMAT field). Now these FORMAT fields are removed to allow the merge to proceed. There is a new flag --preserve-formats to instead output separate variants that keep those FORMAT fields. * vcfeval: New flag --baseline-tp that allows additionally outputing the baseline version of true positive variants (the regular tp.vcf contains the called representation of true positive variants). * vcfeval: --squash-ploidy treats heterozygous calls in baseline and calls as homozygous ALT to allow a lenient comparison. Note that genotypes at multi-allelic sites where neither allele is REF simply choose the ALT with the highest index. * vcfeval: Fixed an exception that could occur when processing variant missing GT information for some samples. * vcfeval: Fixed an exception that could occur when provided variants that were outside the bounds of the supplied reference genome * vcfeval: Fixed an inconsistency when handling ROC files in locales where ',' is the decimal separator. * mendelian: The default is now to perform checks only on non-failing variants. The --pass flag has been removed, and a new flag added --all-records in order to obtain the behaviour of checking all variant records regardless of filters. * mendelian: Now performs concordance checking to detect sample mislabelling and incorrect pedigree. * mendelian: Removed --male and --female flag, which were only needed for VCFs produced by versions of RTG prior to 2.7. If required, alternative pedigree information can be supplied via the --pedigree flag. ### Metagenomics * ncbi2tax: New tool to generate an RTG taxonomy file from NCBI taxonomy dump. * taxfilter: New tool for the custom filtering of taxonomy files and metagenomic reference SDFs containing taxonomy information. * taxstats: New tool for verifying the contents of a metagenomic reference SDF. ### Other * sdfsubseq: The output sequence name is the same as the input sequence if the coordinates are unchanged. * many: Added the ability to read BED from stdin by specifying '-' as the BED file name (this is not supported in cases where a region restriction is also being applied to the file, as this would require the BED to be tabix indexed) * many: Added the ability to read VCF from stdin by specifying '-' as the VCF file name (not supported in cases where a region restriction is also being applied to the file, as this would require the VCF to be tabix indexed) * many: Users of linux bash can enable command and flag completion. See the file rtg-bash-completion in the scripts directory for more information. * bgzip: New flag --no-terminate allows the omission the block gzip termination block. This permits advanced users to compress multiple files for later fast concatenation (the termination block should be present on the final file only). * bgzip: New flag --compression-level allows altering the degree of compression (thus speed) from 1 (least but fast) to 9 (best but slow). * rocplot: GUI mode has better error handling when there is no graphical environment. * rocplot: PNG output mode will attempt to use headless mode to prevent an error when the graphical environment is unavailable. * popsim: Speed improvements. * readsim/cgsim: Added the --sam-rg flag to set the read group information to be stored in the output SDF. Removed --diploid-input as the recommended way to simulate diploid genomes is to use samplereplay or the --output-sdf option of samplesim/childsim/denovosim. * readsimeval: New command for evaluating the accuracy of mapping reads generated by readsim. * pedfilter: New command for pedigree file filtering and simple manipulation and conversion between pedigree PED files and pedigree-augmented VCF headers. * pedstats: New command for extracting information and summarizing information contained in a pedigree file. * aview: The flag --dont-display-dots has been renamed to --no-dots for consistency. RTG Core 3.3.2 (2014-04-09) --------------------------- This is a bugfix only release: * Fix soft-clipping behaviour when using the table-based single-indel aligner. RTG Core 3.3.1 (2013-12-06) --------------------------- This is a bugfix only release: * During variant calling with pedigrees, particulary complex situations were deferring to a new algorithm, but this had undesirable performance characteristics on very large pedigrees. This has been reverted until the peformance can be improved. RTG Core 3.3 (2013-11-29) ------------------------- Major features of this release: * Speed improvements to family calling and population calling, particularly with large numbers of samples. * Speed improvements to mapping as a result of a new table aligner (enabled for Illumina data by default). * Mapping and variant calling have been improved to allow variant calling out to 50bp indels by default, in comparison to previous releases that defaulted to 9bp. When using the new table aligner, there is a net improvement in mapping speed. With the general aligner, mapping speed is impacted, but a full trade-off can be achieved via the aligner band width flag (see below). * Pipeline streamlining. SAM read group information can now be stored within an SDF at formatting time, and this information will automatically be used by subsequent mapping commands. This has necessitated an increase in the SDF version, so old versions of RTG will not be able to read SDFs created by this version. When variant calling exome datasets, the target region bed file can be supplied to automatically flag variants off target, saving an extra vcffilter step. * Mapping and variant calling is now PAR aware. If your reference SDF contains information about PAR regions (as described in the user manual), mapping will occur to only one instance of the PAR region, and during variant calling will automatically switch between haploid/diploid appropriately. Detailed changes are listed below by area. Please read these through fully, as some command-line flags have changed, so updates to your pipeline scripts may be required. ## Basic Formatting and Mapping * cg2sdf: New flag --sam-rg allows the specification of a SAM read group to be stored in the resulting SDF. Note that this means the SDF version has changed, so SDFs produced by this version of RTG will not be readable with earlier versions of RTG. * format: New flag --sam-rg allows the specification of a SAM read group to be stored in the resulting SDF. Note that this means the SDF version has changed, so SDFs produced by this version of RTG will not be readable with earlier versions of RTG. * format: When formatting reads from BAM files, the read group information is automatically stored in the resulting SDF. Only a single read group is permitted per SDF, so if the input contains multiple read groups you must either use the new flag --select-read-group to select only records belonging to the specified read group, or use --sam-rg to explicitly define a single read group that all records will be assigned to. * format/map: Various improvements to handling of input reads stored in BAM files. When reading input from BAM, records that have the "secondary alignment" SAM flag set are ignored (on the assumption that every read should have a single primary record). Warnings will be produced if the same read has multiple primary alignments or if paired-end data does not have matching records for each read-end, along with a summary after formatting indicating how many cases were encountered. * map: When mapping from SDF that contains a read group there is no need to specify --sam-rg, as it is picked up automatically. * map/mapf: Reduced memory usage during mapping when mapping reads from SDF along with the --read-names option. * map/mapf: Added new flag --unknowns-penalty to allow more explicit control over how Ns are scored during alignment. The default value is 5 (in comparison, the default mismatch penalty is 9). * map/mapf: Removed --penalize-unknowns as this is now redundant due to --unknowns-penalty. If desired, equivalent behaviour can be obtained by supplying --unknowns-penalty with the same penalty as the mismatch penalty, or --unknowns-penalty=0 to not penalize unknowns. Note that regardless of the penalty, alignment CIGARs always indicate Ns as a mismatch. * map/mapf: These commands have the ability to use a new aligner for faster alignment and better identification of longer indels. Setting the --aligner-mode=table explicitly enables the use of this aligner, and setting --aligner-mode=general explicitly uses the same aligner as previous versions of RTG. * map: When mapping Illumina data (as determined by the PLATFORM field of the SAM read group supplied), and the --aligner-mode is set to it's default value of "auto", the new table aligner is employed. * map/mapf: The mechanism for setting aligner band width has changed. The flag --aligner-band-width-factor has been replaced by the new flag --aligner-band-width which takes the length of indel that can reliably be detected as a fraction of the read-length. The new default is 0.5, so for 100bp reads the aligners will attempt to find 50bp long insertions/deletions (there may be some cases where longer events are found). Increasing this factor will increase runtime, and decreasing this will reduce runtime. Roughly comparable behaviour and speed to the previous release can be obtained with --aligner-band-width=0.1. When changing the --aligner-band-width, it often makes sense to also adjust alignment score thresholds. * map/cgmap: When performing sex-aware mapping, reads will only be mapped to one occurrence of PAR regions (that on the X chromosome). This requires that your reference SDF reference.txt contains specifications of the PAR regions (see the User Manual for the description of the reference.txt file). * sam2bam: Specifying '-' as the output name will send output to standard out. * sammerge: Specifying '-' as the output name will send output to standard out. * sammerge: New flags --require-flags/--filter-flags that allow accepting or rejecting SAM records based on the settings of the FLAGS column of each record. For example, to reject all records marked as secondary alignments, use --filter-flags 256. * sammerge: New flag --exclude-unplaced to filter out any alignment records that do not have an alignment position. * samstats: Removed --penalize-unknowns, as this tool could not handle the variable penalties for Ns that have been introduced to mapping. ### Variant Calling * all variant callers: Calls now include the GL field for increased compatibility with downstream tools such as Beagle and PolyMutt. * all variant callers: new VCF FORMAT fields are included to aid AVR scoring: PPB detects whether an imbalance of properly paired reads is present, and PUR measures the ratio of placed unmapped reads (those where the mate has been uniquely mapped but the read itself was not mapped) * all variant callers: These callers now perform PAR aware variant calling by default (Males will have PARs on X called as diploid, and on Y not called). There can be some edge effects up to a read length either side of a PAR boundary when complex variant calls are made spanning the boundary. * all variant callers: For some complex calls the AR FORMAT annotation value would sometimes exceed the maximum value of 1. * all variant callers: Calibration based calculation of average coverage (used for AVR predictor attributes and overcoverage level setting) is corrected for Ns in the reference. * all variant callers: new --filter-bed flag for exome region filtering * all variant callers: Chromosomes such as MT that are denoted in the reference.txt as "polyploid" are treated as haploid during variant calling. In previous releases only the snp command would call these chromosomes (the family and population commands would skip calling on these chromosomes). In this release the family and population commands call these chromosomes as haploid and when pedigree is present, maternal inheritance is assumed. * population: New flag --pedigree-connectivity to give explicit control over whether calling is carried out for highly-connected vs sparsely connected pedigrees. The default is to automatically choose the mode. See the user manual for more information on when different modes might be appropriate. * population: Improved memory usage and startup time when running very large population runs. * vcffilter: The --density-window flag was not correctly handling indels at the same coordinate as a SNP. * vcffilter: Reinstated flag --remove-all-same-as-ref (which was incorrectly removed previously). * vcffilter: New flags --max-combined-read-depth and --min-combined-read-depth for filtering on the combined read depth in the INFO column. * vcffilter: New flag --clear-failed-samples for clearing the GT of failed samples instead of removing the entire variant line. * vcffilter/vcfannotate: Specifying '-' as the input name will read from standard in, and similarly specifying '-' as the output name will send output to standard out. * vcfmerge: Specifying '-' as the output name will send output to standard out. * vcfstats: Fixed an exception when running on VCF files containing records with no GT field. * vcfeval: New beta module for evaluating variant call sets (formerly known as snpsimeval). * vcfeval: Now issues warnings when there are differences between the set of sequences contained in the reference vs baseline variant set vs called variant sets. * vcfeval: Previously, extremely complicated situations could consume vast amounts of memory and eventually crash. These now exit gracefully with a message about which region caused the problem. It is currently up to the user to manually filter out the problematic region and rerun. * vcfeval: The --sort-value parameter has been removed in favor of using the --vcf-score-field flag. * rocplot: This is a new module for plotting ROC graphs, which can run both as an interactive GUI or generate a static PNG image. ### Other * extract: New flag --header-only to output only the header for the file of interest. * species: Includes genome length column (this column will only contain a value when the row corresponds directly a species rather than an inner taxonomy node, i.e. when the value in the reference column = Y) * readsim: Allows simulation of PCR duplicates and chimeric reads. * aview: Now has the ability to load BED tracks (e.g. particularly useful to display variant caller regions.bed.gz, or an exome target regions BED file). * aview: When filenames for VCF and BED tracks are specified in the form FILE=TITLE, the title given will be used in the track display. Previous releases: RTG Core 3.2.2 (2013-09-17) --------------------------- This is a bugfix only release: * many: Commands that involved reading from multiple SAM files could produce a crash in some circumstances when mixing single-end and paired-end records. * vcfsubset: Sample presence checking was over-stringent, by checking for SAMPLE header lines as well as a named sample column on the CHROM line. SAMPLE header lines are now not required. * coverage: Fixed an exception that would occur when running with a smoothing parameter larger than 5000. RTG Core 3.2.1 (2013-08-01) --------------------------- This is a bugfix only release: * coverage/species: Fixed a crash when trying to generate graphs for reports when running on a machine without X11 available. * many: I/O exceptions raised during asynchronous file reading could sometimes cause a talkback instead of a graceful user error message, or rarely would fail to detect the exception altogether. * vcffilter: Even when not specified, AR filtering was removing values greater than 1 if other per-sample filtering was being carried out. (These AR values greater than 1 are very rare and will be addressed with in a subsequent release) * vcfsubset: Up-front checking of field names supplied by the user (FILTER/INFO/FORMAT) instead of causing an exception during later processing. * vcfsubset: When the stripping of specified FORMAT subfields would result in a variant having no format sub-fields remaining, rather than outputting an invalid record, the whole record is removed (and a count of such records is output upon completion). RTG Core 3.2 (2013-06-20) ------------------------- IMPORTANT NOTES WHEN UPGRADING: * As requested by many customers, the default alignment output by map commands is now a single BAM file, named 'alignments.bam', to further simplify subsequent processing without the need for an explicit merge. Any scripts you may have will need to be updated before working with this new version. The old behaviour can be obtained using the flags --sam and/or --no-merge. See the user manual for more information on these flags. * The pre-built AVR models have been revised and rebuilt, and now includes separate models for exome and WGS datasets. As such, the models will have somewhat different characteristics compared to their analogs from 3.1. Most notably the scores are likely to be spread over a wider range than before, so any cut off thresholds currently in use for filtering should be adjusted. The three pre-built models are now 'illumina-exome.avr', 'illumina-wgs.avr', and 'alternate.avr' Other major features of this release: * Population priors can now be supplied in a more compact form, allowing for much faster processing when using population priors derived from a large number of samples. * Significant speed improvements to family calling and population calling (particularly pedigrees involving many families). * Many improvements to avrbuild for customers building their own models. These include greatly reduced memory requirements and improved speed. The models are also now self balancing so discrepancies between the amount of positive and negative training data should have less impact on the spread of scores produced. Detailed changes are listed below by area. Please read these through fully, as some command-line flags have changed, so updates to your pipeline scripts may be required. ### Basic Mapping * map/cgmap/mapf: The default outputs of these commands have been changed. Rather than outputting separate block-compressed SAM files for each of mated / unmated / unmapped reads, these commands now output a single merged alignments BAM file. The implementation of this incurs a much smaller overhead than performing a separate post-mapping merge. Two flags have been added that can be used to obtain the old behaviour: --sam will output SAM rather than BAM, and --no-merge will cause separate output files to be produced. ### Variant Calling * snp/family/population: The ambiguity ratio VCF annotation was not being output for complex variant calls. * snp/family/population: The representation of indels in output VCF now only includes the previous base when absolutely necessary (previously the previous base was include for all indels, in accordance with an earlier version of the VCF specification). * snp/family/population: VCF records now include an annotation containing the allelic depth (AD) for each sample. * snp/family/population: Calling with population priors (via --population-priors) will now extract allele counts from the INFO AC/AN fields if these are there, and only fall back to counting from per-sample GT fields if AC/AN are not present. Processing a population priors VCF that contains only these INFO fields and no sample columns can be significantly faster to process. Such a reduced VCF can be constructed by passing the original population priors VCF through rtg vcfannotate --fill-an-ac, followed by rtg vcfsubset --keep-info AC --keep-info AN --remove-samples. See the user manual for more information. * snp/family/population: The ABP/SBP annotations for complex called variants were not being calculated correctly in all cases. * family/population: Much faster computation of pedigree posteriors (for some cases involving many families the entire variant calling run is 5x faster) * population: Complex call splitting was sometimes not correctly representing any disagreeing hypotheses (DH attribute). This has now been fixed. * vcffilter: The --sample flag can be specified multiple times to require that any sample-specific filtering criteria apply to more than one sample. * vcffilter: New flag --all-samples can be used to apply any sample-specific criteria to every sample in the VCF. * vcffilter: New flag --non-snps-only. * vcffilter: Renamed flag --snp-density-window to --density-window, since it acts on all variants, not just SNPs. * vcffilter: Removed flag --remove-all-same-as-ref, as this can now be achieved with --remove-same-as-ref --all-samples * vcffilter: New flags --min-denovo-score and --max-denovo-score help identify high quality de novo variants. See the user manual for more information. * vcfannotate: Restructuring of the flags accepted by vcfannotate. To annotate with variant IDs contained in a BED file, use --bed-ids BEDFILE. To annotate from a BED file into the INFO field of the output VCF, use --bed-info. To annotate with variant IDs contained in a VCF, use --vcf-ids VCFFILE. * vcfannotate: New option --fill-an-ac to recompute AN and AC INFO fields based on GT fields present in the VCF. * vcfsubset: New command for performing column-wise removals from a VCF file, such as removing samples, format sub-fields, info fields, etc. RTG Core 3.1.2 (2013-05-02) --------------------------- Changes in this release: * vcffilter: Fixed a bug when filtering multi-sample VCF files by --min-avr-score or --max-avr score. The code was ignoring the --sample flag and always basing the filtering on the AVR score of the first sample. * family: Added a new flag --pedigree to allow specifying the family in PED format instead of via --father/--mother/--son flags etc. The PED file must contain only a single nuclear family. RTG Core 3.1.1 (2013-04-24) --------------------------- This is a bugfix only release: * map/cgmap: When using --bed-regions during mapping of exome data, information for sequences not referenced in the BED was not correctly initialized, leading to an error message in subsequent variant calling. * avrbuild: A small improvement in the treatment of missing values during model building. * population: Fix a problem with internal ids associated with families. RTG Core 3.1 (2013-04-19) ------------------------- Major features of this release: * Improvements to alignments produced during mapping. The aligner penalties have changed to give better behaviour regarding indel detection. There are several additional user controls to give finer control over aligner behaviour. * De novo variant calling in pedigree calling. Variants in offspring will automatically be marked in the output, and an additional score is produced that indicates the confidence that the variant is a true de novo variant. * Adaptive Variant Rescoring (AVR) is a new capability that allows a scoring of variants that incorporates effects such as library prep or mapping artifacts that are not directly incorporated in the Bayesian variant modelling. AVR incorporates machine learning models built from predictor attributes that empirically correlate with correctness with respect to a base variant set. Several new attributes have been added to output VCFs to facilitate this. This release includes some pre-built scoring models and provides tools to allow building models that may be better adapted to particular projects. * Streamlined processing of exome datasets. Mapping now directly supports the specification of an exome regions BED file to ensure variant calling includes appropriate calibration information for automatically determining over coverage situations. Detailed changes are listed below by area. Please read these through fully, as some command-line flags have changed, so updates to your pipeline scripts may be required. ### Basic Mapping * format/map/mapf/mapx: Input FASTQ that is badly formed due to a mismatch between the sequence names given in the "@" vs the "+" sections of the record used to be silently skipped. These records are now processed but warnings are issued that the FASTQ is badly formed. * map: New flag --bed-regions that should be used during exome processing to ensure correct calibration information is available during variant calling. See the user manual for more information on exome processing workflows. * map/mapf/cgmap: Several flags have their long names renamed to improve consistency with SAM specification terminology and for updated semantics in the presence of altered aligner penalties: --max-insert-size renamed to --max-fragment-size --min-insert-size renamed to --min-fragment-size --max-alignment-score renamed to --max-mismatches --max-mated-score renamed to --max-mated-mismatches --max-unmated-score renamed to --max-unmated-mismatches * map/mapf/cgmap: The various --max-*-mismatches flags have a slightly different interpretation with new aligner penalties. This now sets an alignment score threshold that would allow the specified number of mismatches. * map/mapf: These commands have new flags that give more control over aligner penalties and indel detection capabilities: --mismatch-penalty. The penalty used when scoring a mismatch. --gap-open-penalty. The penalty used when scoring the opening of a gap. --gap-extend-penalty. The penalty used when scoring the extension of a gap. --soft-clip-distance. When using aligner penalties that favour detection of indels, often incorrect indels will be produced at the ends of reads. This flag specifies the distance from the ends of reads within which any indels will be soft-clipped. --aligner-band-width-factor. Increasing this factor will allow longer indels to be aligned, at the expense of speed. A factor of 1 gives room to find indels with length corresponding to --max-mismatches. A factor of 2 will double this length. * map/mapf: The default aligner penalties have been updated to give better handling of indels in alignments. To achieve scoring similar to previous versions, use --soft-clip-distance=0 --substitution-penalty=1 --gap-open-penalty=1 --gap-extend-penalty=1 * coverage: Speed improvement when processing references containing large numbers of sequences. * coverage: Automatically outputs an HTML report that graphically shows aggregate coverage levels. * coverage: Fixed handling of zero length reference sequences. ### Variant Calling * family/population: These commands automatically perform detection of de novo variants in children. This evaluation is represented by two new FORMAT attributes in the VCF. DN is a binary indicator as to de novo status, which will appear for any child samples in which one of the siblings contains a putative de novo variant (with value 'Y' for the putative de novo), and DNP contains the de novo posterior score. * family/population: These commands implement phasing by descent on a site-by-site basis. Wherever it can be unambiguously determined which parent contributed each allele, the child GT is phased so that the paternal allele is first, maternal allele second. * population: Fixed a rare corner case where overlapping calls could be output when multiple families were present in the pedigree. * snp/family/population: These commands now no longer hard filter on ambiguity ratio by default, the motivation being that this functionality should be subsumed by AVR. * snp/family/population: These commands automatically outputs a simple HTML report containing summary information on variant calls. * snp/family/population: There have been many changes to the annotations produced by default, primarily to provide predictor attributes for use with AVR. See the user manual for the full descriptions of these new annotations. * avrbuild: New command to build a machine learning model of variant accuracy from annotated training data. See the user manual for more details on AVR model building. * avrpredict: New command to score variants using a machine learning model created by avrbuild. * snp/family/population: These commands have a new flag --avr-model to supply a machine learning model that will be used to score variants during calling. This release includes two pre-built models, and users may build your own using avrbuild. * avrstats: New command to display simple information about an AVR model. * mendelian: New flag --output to output all input variants (with mendelian violation information added as annotations) to a single output file. * snpfilter: This previously beta command is now renamed to vcffilter. * snpannotate: Renamed this beta command to vcfannotate. * vcffilter: New flags --include-vcf and --exclude-vcf to include or exclude variants that overlap with those contained in the specified VCF file. Removed --allele-balance-variation flag (the allele balance annotation it operated on has been removed, superseded by a similar annotation capturing allele balance information). * vcffilter: New flags --max-avr-score and --min-avr-score to allow filtering on the AVR annotations. * vcffilter: New flag --remove-overlapping to discard any variants that overlap with previous variants. * vcfmerge: When merging records at identical locations that contained different ALT alleles, any FORMAT attributes that used VCF number entries "A" or "G" (e.g. GL, PL, GP fields produced by other variant callers) would become invalid. vcfmerge will now leave those records unmerged in the resulting VCF if any A or G number attributes are present. * snpsimeval: New beta module for performing very detailed concordance analysis of two variant call sets, capable of resolving differences in representation of equivalent calls. ### Other * many: Fixed confusing reporting of out-of-disk space errors in multi-threaded situations. * aview: New beta module for creating visual pileups in a terminal or HTML form. RTG Core 3.0 (2013-02-15) ------------------------- Below are the major changes in this release. Please read these through fully, as some command-line flags and output file names have changed, so updates to your pipeline scripts may be required. ### Metagenomics * mapf: Added the flag --sam-rg. This allows you to specify the platform from which the reads originate (in particular, when the platform is IONTORRENT slightly different alignment parameters are used). * similarity: This command now computes a principal component analysis on the similarity matrix and outputs this to a file named similarity.pca in the output directory. This analysis is to better allow sample clustering. * species: Now calculates several species diversity metrics (species richness, Shannon diversity index, Pielou evenness index, and the inverse Simpson index) and outputs these to the summary.txt file. * species: Computes upper and lower bounds for abundance estimates, along with a confidence value for each species. See the user manual for more information. * species: Incorporates taxonomic information to allow clade-level abundance estimation. This requires the metagenomics reference database to contain taxonomic information. The easiest way is to obtain a metagenomics reference database from Real Time Genomics. See the user manual for more information. * species: Removed the upper limit of 400 on the number of input SAM files. * species: Removed --iterations flag, the termination criteria is now determined automatically. * species: Added --min-confidence flag to specify the minimum confidence value for which a species/clade is reported in the output. * species: Added --threads flag for setting number of threads. Species will now utilize multiple cores when possible, although for some input datasets there may be large portions of the run that only use a single core. * species: Now automatically produces an HTML formatted report that allows interactive visualization of abundances (this feature is based on the Krona visualization tool). * sdfstats: Added --taxonomy flag to allow outputting basic taxonomy information about a metagenomics reference SDF. This includes statistics such as the number of taxon nodes, the number of nodes with sequences, and the number of other nodes. * Added three pipeline commands (composition-meta-pipeline, functional-meta-pipeline, composition-functional-meta-pipeline) to simplify the common use-cases of performing abundance and functional analysis starting directly from reads. These commands internally call several RTG commands in sequence and output an HTML report containing summary information and links to primary output files. ### Basic mapping * map/cgmap: These commands now automatically output an HTML report containing mapping summary statistics and graphs useful for QC. * map/cgmap/mapf: Speed improvement (~5%) when mapping reads stored in an SDF due to more efficient SDF loading. * mapping: Bugfixes to aligners to better handle some edge cases (heuristic aligners would occasionally prefer an alignment with a higher score). * calibrate: Added a new flag --bed-regions to restrict the calibration calculations to the regions in the provided bed file. This option should be used to calibrate mappings of exome data in order for the calibration files to contain the correct depth of coverage information for supplying to variant calling. (Upcoming releases will allow the regions file to be supplied directly during mapping, avoiding the need for exome processing to have this extra step). * coverage: Added a new flag --bed-regions to restrict the coverage calculations to the regions in the provided bed file. This allows coverage to be reported both per-exome region, and across all exome regions. * coverage: Some of the contents of the summary file have moved to separate files. stats.tsv contains per-reference-sequence (or per region for exome analysis) coverage statistics. levels.tsv contains breakdowns of the proportion of the reference genome (or exome when appropriate) at the various coverage levels. ### Variant calling * snp/family/population/somatic: The regions.bed file describing the types of processing occurring across the genome contains some new categories, and the values used in the name column of the bed file have been improved. See the user manual for more information. * snp/family/population/somatic: The GQ values produced for each sample have been improved. * snp: Removed a case where variant calls calls failing ambiguity filters could cause overlapping calls to be output. * snpfilter: This tool can output to stdout when '-' is used as the the output file name. * mendelian: Added the ability to read VCF directly from stdin when '-' is used as the input file name. * vcfstats: Added the ability to read VCF directly from stdin when '-' is used as the input file name. * vcfstats: --known and --novel flags added for calculating stats for known only or novel only variants, as determined by whether the VCF record has an identifier set (i.e.: column 3 of the record). ### Other * many: We have added the ability to log the usage of RTG commands to a server. Depending on your license, it may be a requirement to have this enabled. See the user manual for more information. * many: For consistency, several output file extensions containing tab-delimited data have been changed from txt to tsv where appropriate. This better corresponds to the file contents and allows appropriate "click-to-open" behavior for viewing tabular results in a spreadsheet application. Full listing below: mapx: Renamed primary output files (alignments.txt -> alignments.tsv and unmapped.txt -> unmapped.tsv) map/cgmap: Read group statistics output file (rgstats.txt -> rgstats.tsv) sv: Primary output files (sv_simple.txt -> sv_simple.tsv and sv_bayesian.txt -> sv_bayesian.tsv) assemble: The graph output directories now use .tsv files instead of .txt files for the Path.N and header output files, but can still read the existing graph output directories. snpsimeval: All ROC output files now use .tsv extension. * many: When specifying a --region using name:start+length notation, an extra base was being included. * samrename: Added flags --no-gzip and --no-index for consistency with similar commands. * samrename: Added flags --end-read and --start-read to reduce the memory requirements when renaming mapping outputs that also had been run on a subset of reads. * readsim: When simulating a metagenomic sample, this beta tool now outputs a file containing the generated fractions for each input sequence. * popsim: New beta tool to generate a VCF containing simulated population-level variants. * samplesim: New beta tool to generate a simulated pedigree-free member of a population using variants defined in a VCF (such as that created by popsim). * denovosim: New beta tool to simulate de novo variants within a genome. * childsim: New beta tool to generate a simulated genotype as offspring of existing parent genotypes defined in a VCF. * samplereplay: New beta tool to replay variants defined in a VCF into a reference genome, to be used with readsim. RTG Investigator 2.7.5 (2013-02-08) ----------------------------------- This is a bugfix only release: * snp/family/population: Fix an infinite loop that could occur near the start of variant calling involving multiple read groups. RTG Investigator 2.7.4 (2013-01-11) ----------------------------------- This is a bugfix only release: * many: SAM read group platform name checks are now case-insensitive. * map: When mapping input reads contained in SAM files, the input no longer needs to be sorted by read name. RTG Investigator 2.7.3 (2012-11-23) ----------------------------------- This is a bugfix only release: * snp/family/population: Fixed an exception if the VCF supplied as population priors contained a variant at the first base of a chromosome * vcfmerge: This tool used to attempt to merge variants at the same reference position where the length of reference spanned differed between variants, by adding padding reference bases to the shorter variant. This is misleading, as it assumes the subsequent bases are non-variant. These variants are no longer merged. * vcfmerge: Now outputs warnings when overlapping variants are encountered within a sample. * snp/family/population: Fixed an exception that could occur when processing mappings with very long (>1Kb) indels/skips with an alignment. RTG Investigator 2.7.2 (2012-10-23) ----------------------------------- This is a bugfix only release: * population: Fixed another corner case that could arise with the disagreeing hypotheses. RTG Investigator 2.7.1 (2012-10-19) ----------------------------------- This is a bugfix only release: * population: The disagreeing hypothesis attribute (DH) was not formatted correctly, and could contain a ':' character which made the resulting VCF invalid. The first element of the DH attribute is now using GT-style notation (i.e. expressed using allele IDs rather than the allele bases themselves). RTG Investigator 2.7 (2012-09-22) --------------------------------- Changes in this release: ### Basic mapping * cgmap: Removed the --read-names flag, as the Complete Genomics raw reads files do not contain read names. * map/cgmap: Speed improvements when mapping very large read datasets. * map/cgmap: The functionality of svprep has been integrated into mapping, to assist with more streamlined indel and structural variant discovery. This stage can be disabled by using the flag --no-svprep. * map/cgmap/mapx: These mapping tools now produced unmapped reads output by default now. The flags -U/--report-unmapped from previous versions have been replaced by a new flag --no-unmapped to disable output of unmapped reads. * coverage: The reference SDF is now optional. If not supplied, non-n coverage statistics will not be computed. ### Metagenomics * species: Added the ability to supply unmapped reads SAM files, in which case it will compute a new statistic, which is the percentage of the entire sample that each species represents. See the user manual for more information. * species: The species module requires coordinate-sorted mappings but was not checking whether input mappings were in fact coordinate sorted. This is now fixed. ### Variant calling * all callers: Fixed an exception that could occur on short reference sequences containing few calls. * all callers: Reduced memory usage, particularly when using --all mode. * all callers: Improved speed when calling on very high coverage single-end data. * all callers: The output VCF now includes a SAMPLE header line that indicates the sex of the sample when known (this is useful for downstream tools such as rtg mendelian). * all callers: Population priors that contained a large number of genomes (e.g. 1000 genomes VCF) would use much more memory than required. Such very large files will still be very slow to parse though, and may even dominate variant caller runtime. An upcoming release will address this. * all callers: Population priors are now also used to inform complex calling. * all callers: The behaviour of the --max-coverage and --max-coverage-multiplier flags have changed. Previously these adjusted both built-in thresholds used to skip complex calling in areas of very high coverage, and coverage filters applied to variants at output time. These are now used only to control the maximum depth of coverage (across all samples) for which calling is made. * all callers: Output filtering on depth of coverage can still be achieved by using the new --filter-depth and --filter-depth-multiplier flags. The default is to not apply any coverage filtering, however if you are running datasets with read lengths much less than 100 it may be beneficial to set this to approximately 3 times the average coverage or apply other post-call filtering. See the user manual for more information. * all callers: the --max-ambiguity flag has been renamed to --filter-ambiguity to clarify that it is a filter applied at the output level. * all callers: Fix an exception when input mappings contain QUAL scores higher than 63. * population: Improved error handling when parsing of PED files. * population: Hypothesis pruning is now turned on by default, previously it would experience performance problems in areas of extremely high coverage due to too many candidate hypotheses. * population: Now utilizes family calling code when relationships in the input pedigree file indicates families are present. If you wish to disable relationship information during calling, simply supply a pedigree file that contains missing values (0) in the paternal id / maternal id columns. In cases where a sample is a member of multiple families (e.g. parent in one family and child in another), but the calls from each family disagree, these calls are annotated with the DH attribute (these are good candidates for de novo mutations). * population: the embedded family calling code also supports partial families where one or more members are referenced in the pedigree file but have not been sequenced or mapped. By default, calls will only be produced for samples for which input mappings have been provided. There is a new option --impute to output the imputed genotype for a family member that has not been sequenced/mapped. The accuracy of these calls will be better for parents in families with many children. * vcfstats: New flag --sample to output variant statistics for only the specified sample (the default behaviour is to calculate statistics for every sample in the VCF file). * vcfstats: New flag --allele-lengths to output a histogram of variant lengths, broken out by variant type. ### Other * all variant callers/coverage/cnv/sammerge/readsimeval: These modules contain a new flag, --min-mapq that allows filtering the input SAM to ignore all SAM records with MAPQ lower than the given value. * all: Changed the wording in progress files final entry for failed runs to make failures more obvious and for easier script processing ("unsuccessful" -> "failed"). * mendelian: When outputting records containing Mendelian errors to a VCF file, the flag has been renamed from --output to --output-inconsistent. There is a new flag --output-consistent to output those records that do not contain mendelian errors. * mendelian: The summary percentage statistics take the total records as only those for which some family members contain alternative alleles. * mendelian: The --male and --female flags need not be supplied when checking a VCF file that contains sex information in the SAMPLE headers. * mendelian: Family pedigree information can now alternatively be supplied via .PED pedigree file. * snpfilter: New flag --remove-all-same-as-ref to remove variants where all samples are non-variant. * bgzip: This command now accepts multiple file names for zipping/unzipping. * snpsimeval: The flag -m/--mutations has been renamed to -b/--baseline. RTG Investigator 2.6 (2012-05-22) --------------------------------- Major features of this release: * Improvements to mapping speed. Many of the components used during mapping have been examined, improved, and multi-threaded, in particular calibration, tabix index creation, handling of mapping direct from FASTQ/FASTQ/SAM/BAM. For our typical HiSeq mapping runs, elapsed time for a mapping run is approximately 30% faster. * Reductions in memory use during mapping. These contribute to the speed improvements above but also mean that more reads can be mapped in a single run on a given compute node. * Reductions in temporary disk footprint during mapping. Many temporary files encode there information more efficiently and are cleaned up as soon as they are no longer needed, and giving approximately 50% reduction in temporary disk usage. * Significant speed and memory improvements to the coverage module, in some cases orders of magnitude faster and less memory. * New module for performing population aware variant calling on multiple samples simultaneously. This can provide a significant improvement in accuracy over the single-genome variant caller, particularly when the per-sample mappings are relatively low coverage (below 15x) Changes by command: * format: Better error checking and improved output when formatting FASTA data. * sdfsubseq: The region is now supplied as an anonymous flag (i.e. the "--region" or "-s" are no longer needed), and it is possible to supply multiple regions. This makes it easy to extract a FASTA file containing multiple regions extracted from a SDF. * many: Tabix indexing of SAM/BAM output files is significantly faster. * map/cgmap/calibrate: Calibration has been sped up significantly. * map/mapf: Reduced memory usage during mapping, particularly when mapping direct from FASTA/FASTQ or using --read-names. * map/cgmap: Reduced disk requirements for intermediate files during mapping by implementing more efficient temporary files and deleting temporary files as soon as they are no longer needed. * map/cgmap: Multithreaded the tabix indexing of output SAM/BAM files. * map/cgmap/mapf: Fix exception when mapping very large read sets with quality data. * map/cgmap/mapf: Fix exception when determining percentage based repeat frequency cutoff for very large read sets. * map/samrename: Illumina paired-end reads with /1, /2 arm-indicator suffix have these suffixes stripped in the output SAM/BAM QNAME field, as per the SAM specification. * sammerge: Multithreaded handling for the case of multiple input files for improved throughput. Use the new --threads flag to customize the behaviour. * coverage: Significantly (1 to 2 orders of magnitude) faster and more memory-efficient implementation. * coverage: New option --keep-duplicates flag to disable the automatic detection and removal of optical/PCR duplicates during coverage calculation. * coverage: The smoothing calculation now takes into account the reduced size of smoothing window at the edges of reference sequences. * species: More robust parsing of the species relabel file. * species: Now omits from the output any species that are determined to not occur in the sample. A comment line is included in the output that lists the number of omitted species. * population: This is a new command that performs multi-sample population variant calling. See the user manual for more information and examples. * snp/family/population/somatic: AB and AR genotype fields are now only output when the call has covering reads. There are rare cases when calls may be made with no covering reads (primarily when evidence for the call comes from other family or population members) * snp/family/population/somatic: New flag --population-priors allows supplying an input VCF containing variants observed in the population, to be used as priors during calling. The input VCF must be tabix indexed. * snp/family/population/somatic: Improved speed during complex calling (some runs are 25% faster). * snp/family/population/somatic: Minor improvements to complex calling trigger conditions, particularly with higher coverage and in the presence of short repeats. * somatic: Fixed an exception in areas of high coverage and high complexity (i.e. when many possible hypotheses are observed). * vcfmerge: New flag --stats to output summary statistics corresponding to the contents of the output VCF. * vcfmerge: The default behaviour is to refuse merging when fields are encountered with the same ID but differing descriptions, as the field semantics may be completely different. There is a new flag --force-merge allows such headers to be merged on a per field basis. * svprep: This command now updates the unmapped reads SAM file if present to give the expected location of unmapped arms when the other arm is uniquely mapped. This can be used to assist analysis of split-reads and structural variants. * discord: The --max-ambiguity and --max-coverage flags have been removed, as they were of dubious value and impeded development of other functionality. * discord: Calls with predicted breakpoints falling outside the reference sequences have those predicted locations adjusted to comply with the specifications for the relevant output format. RTG Investigator 2.5.2 (2012-05-08) ----------------------------------- Changes in this release: * many: Several data file format problems are now reported as a regular user message rather than causing a talkback. * coverage: This command was flushing far too often, resulting in slower run times than 2.4. This has been fixed. * cgmap: Better validation for input read data containing expected read lengths. * mapx: The maximum word size of 12 is now checked during parameter validation. * vcfmerge: Flag-type INFO values were not being passed through correctly. * species: Fixed exception when input species contained 0-length genomes. * family: Fixed an exception that could occur when calling on the Y chromosome. RTG Investigator 2.5.1 (2012-03-14) ----------------------------------- Changes in this release: * map/mapf/cgmap: Fixed exception when mapping large (23Gb) input files directly from FASTA/FASTQ. * map/mapf/cgmap: Doubled the limit on the number of reads that can be handled in one mapping run, assuming available memory, (to 2^31 single end reads, 2^30 paired-end reads). * snp/family/somatic: Fix exception when given input mappings with average coverage less than 1 on any reference sequence. * many: Truncated bgzip files are now reported as a regular user message rather than causing a talkback. * map/cgmap: Calibration is slow when you have large numbers of reference sequences. You may use the new --no-calibration flag to disable calibration if you do not require calibration data for subsequent variant calling. RTG Investigator 2.5 (2012-03-06) --------------------------------- Major features of this release: * Multithreaded variant calling. While it was always possible to use the --region command to execute variant calling as separate jobs, the variant callers (snp/somatic/family) are now internally multithreaded for improved throughput. * Improvements to variant calling accuracy. Primarily through better complex calling, automatic duplicate read detection, and improvements to calling near simple repeats. * Improvements to somatic caller accuracy. In particular, the somatic caller now explicitly models loss of heterozygosity events and includes loss of heterozygosity information in VCF records where appropriate. * Improved Ion Torrent support. This includes support for paired-end Ion Torrent data, and more accurate variant calling. * Mapping commands may now directly output BAM as an alternative to block-compressed SAM. * SDFs now store the full description line of sequence names from FASTA/FASTQ input files. Where appropriate, commands use the full name in output (e.g. in many cases you won't need to supply a relabel file when running the species command with long species names). The SDF version has been incremented as a result of this change. RTG supports SDF backward compatibility but not forward compatibility, so versions of RTG prior to 2.5 will not be able to read these newer SDFs. * Beta level commands for structural variant detection. Feedback on these commands is particularly welcome. Changes by command: * many: Output SAM/BAM now declare themselves as version 1.3 in the header. * many: The TLEN/ISIZE field of SAM files is now calculated as the "observed template length" (as described in the SAM 1.3 specification) rather than the "distance between 5' ends" (described in the SAM 1.2 spec). * many: More robust parsing of the genome SDF reference.txt file that specifies sex / ploidy. * many: Accept BAM indexes named .bai, not just .bam.bai. * many: Performance improvements to tabix reading. * many: The --no-tabix-index flag has been renamed to --no-index (and applies to both .tbi and .bai index production). * wrapper: The rtg wrapper script was broken if there was a space in the path to your Java executable (this includes the case of using the bundled JRE when there was a space in the path to RTG installation directory). * format: Support for formatting coordinate-sorted SAM/BAM read data. Note however that there may be speed issues when mapping such data though, as often all unmated / unmapped reads will be processed at the same time and these are more intensive to map than reads that can be properly mated. * map/mapf/cgmap/mapx/sdfsplit: Fixed an implementation limitation on the number of bases in a read set that could be processed in a single run. The old limitation was at approximately 45Gnt (per arm for paired-end data), roughly 1.3B CG reads or 450M HiSeq reads, which has been removed. Mapping commands still have an implementation limit on the number of reads that can be processed in a single mapping run (2^30 single end reads, 2^29 paired-end reads) which will be addressed in an upcoming release. * map/cgmap: Fixed an exception that occurred on very large datasets (~500M reads) when the proportion of unmated reads was high. * map/mapf/cgmap: Added the ability to output BAM rather than tabixed SAM (use the new flag --bam to enable this). * map/mapf/cgmap: Added the ability to only mate reads that map in a particular orientation, to improve the ability to identify structural variants. For example, typical Illumina reads should have --orientation FR, Complete Genomics reads should have --orientation TANDEM. The default (ANY) does not enforce any particular orientation on mated reads. * map/mapf/cgmap: Mapping against a reference that contains duplicated sequence names could cause an exception. This has been fixed. * sam2bam/sammerge: When merging/converting files that have accompanying calibration files, a merged calibration file will automatically be created. * sammerge: Flag --exclude-duplicates was not working correctly. Fixed. * sammerge: Added the flag --legacy-cigars to allow convert new-style (X/=) CIGARS to legacy cigars in the output. * sam2bam/sammerge/coverage/snp/family/somatic: These commands no longer have a limit on the number of input SAM files. * index: The format "snp" is no longer listed as an option (use "vcf" instead). The format "coverage" has been removed (use either "bed" or "coveragetsv" instead). * coverage: New option --bedgraph to cause output to be written in bedgraph format rather than bed. This allows coverage files to be directly viewed in tools such as IGV. * snp/family/somatic: Fully multithreaded variant calling. * snp/family/somatic: Now implements duplicate read detection. Use --keep-duplicates to disable this behaviour. * snp/family/somatic: Now automatically loads mapping calibration files corresponding to each input SAM file (they may still be explicitly supplied if desired, for example if you typically move or rename SAM or calibration files after mapping). Use the new flag --no-calibration if you wish to disable calibration. * snp/family/somatic: Improvements to complex calling * snp/family/somatic: Improvements to calling near simple repeats (homopolymer/dinucleotide repeats) * snp/family/somatic: Better error handling if specifying a --region with coordinates that exceed the bounds of the requested chromosome. * snp/family/somatic: New flag --coverage-multiplier to adjust the thresholds used to detect over-coverage calls when calibration information is available. The default coverage multiplier is 2.0 (i.e. variants where coverage is twice the average coverage over the entire sequence will be flagged as over-coverage). * snp/family/somatic: Changed representation of variant calls that fail overcoverage filters (OC in the FILTER column, plus CT in the INFO specifies the actual threshold applied. See the user manual for more information). * snp/family/somatic: Now outputs total depth (DP) field in VCF. * snp/family/somatic: Utilizes RTG Complete Genomics extended cigar attributes (XQ/XR/XU) when present (this primarily affects non-complex calling, as complex calling performs its own probabilistic realignment). * snp/family/somatic: Fix an exception when running variant calling against mappings containing read groups that have no platform specified. * somatic: Now models loss of heterozygosity events and indicates in the VCF whether a variant represents a LOH event. See the user manual for more details. * vcfstats: New command to output summary statistics for a VCF file. * snpfilter: Changed the default behaviour so that if no flags are specified, no filtering is performed. Flags must be explicitly supplied corresponding to the desired filtering. * species: Graceful termination in the case of running detection against a single species genome. * species: Now outputs fractions as both a percentage of the mapped reads and as a percentage of the whole sample. You should now supply the unmapped SAM files from mapping to allow the sample-level calculation to occur. * species: The format of the relabel file has changed. See the user manual for more information. * sdfsubset: Added --names to allow extracting sequences by name rather than sequence ID. Beta commands: * mendelian: This command detects variants that violate mendelian inheritance constraints. The records may be output directly to a VCF file. * vcfmerge: This command combines separate input VCF files (either as a result of calling the same sample on separate chromosomes, or to combine individual sample VCF into a multi-sample VCF) into a single output VCF. * svprep: Produces additional statistics in order to support the discordant read breakpoint tool. The argument specification has also changed slightly in that --input need not be supplied before the input mapping directory name. * discord: New breakpoint finding tool based on clusters of discordant reads. RTG Investigator 2.4.1 (2012-02-13) ----------------------------------- This is a bug fix release only. * cgmap: The --sex flag was not being correctly obeyed. * sdf2fastq: Fix for incorrect sequence output from SDFs containing variable length reads. * coverage: Fixed a case where 0 coverage could results in a NaN in the output file. RTG Investigator 2.4 (2011-11-23) --------------------------------- Major features of this release: * mapx now has support for variable length and reads longer than 189nt. Bear in mind that as mapx currently performs global alignment, longer reads will be less likely to have a high scoring match - you may need to adjust alignment thresholds appropriately. * The snp module for calling SNPs, MNPs, and indels now supports haploid calling, and is faster (almost 2x faster for Complete Genomics data). * End to end handling of sex chromosomes in human variant calling. After creating a one-off chromosome specification file for your reference genome, mapping and variant calling commands allow you to specify the sex of each individual being processed. * Improved SNP calling accuracy for Ion Torrent, largely as a result of better handling large indels during initial mapping and realignment during variant calling. * New somatic variant caller (licensees only). As with the singleton variant caller, this module is also able to utilize the chromosome specification file to automatically produce appropriate haploid/diploid calling on sex chromosomes. * New pedigree-aware family variant caller (licensees only). This caller performs joint calling of all members of a family (mother, father, and any number of sons/daughters). This particularly improves the accuracy of variant calling when coverage of each individual is low. As with the singleton caller, this module is also able to utilize the chromosome specification file to automatically produce appropriate haploid/diploid calls on sex chromosomes. Changes by command: * family/somatic: These modules now implement complex calling resulting in improved accuracy. * family: Now produces a QUAL score. * mapx: Support for variable length read sets - previously read sets with more than a few nt deviation in length were not supported (if attempted, mapping performance would degrade with shorter reads). Variable length reads are now fully supported. * mapx: Initial support for reads longer than 189nt. * mapx: Handling of the --max-alignment-score for percentage based thresholds was incorrect in that it was calculated based on the pre-translated read length. This is now fixed and the flag description has been updated. * snp: Improvements to Ion Torrent snp calling (as determined by the read group platform field being set to IONTORRENT). * snp: Added new flag --ploidy to allow specifying whether to perform haploid or diploid variant calling. * snp: Switched to new internal architecture to more readily allow multithreading. It no longer has a limit on the number of input SAM files, however now the input SAM files must be tabixed (or indexed BAM). * snp: Fixed the handling of calling near boundaries of user-specified --region locations (previously mappings overlapping the region border were not being supplied to the snp caller). * snp: CG snp calling speed is approximately 2x faster. * map/snp/somatic/family: Added support for sex specific mapping and variant calling by defining a reference configuration file and using the appropriate --sex flag during mapping and snp calling. See the user manual for more details. * sdfstats: New option --sex to list the reference sequences along with their ploidy for each sex. * map/snp/somatic/family: Improvements have been made to the calibration files produced during mapping allow snp calling coverage filters to handle coverage variations per sequence (e.g. due to varying ploidy on sex chromosomes). You can generate new calibration files for existing mappings with the calibrate module. * snp: New VCF filter RCEQUIV denotes when a variant is equivalent to a previous variant (these typically occur at either end of homopolymer regions). * snp: New output file regions.bed.gz containing extra information regarding the calling. Currently it lists the regions that were called using complex calling. * snp: QUAL scores for extremely confident calls were being capped at 1000000, however this was also including all scores above about 3000. QUAL scores are now more accurately output in the VCF. * The .bz2 decompression library could not handle multi-member files. This has been extended to support these files. * extract: Bug fix when extracting VCF/coverage from a file containing a single reference and no region was specified. * extract: Bug fix for when the specified region contained an invalid range. * all: Updated the bundled JVM to 1.6.0_29 * windows: Fixed a problem when RTG was installed to a location containing spaces in the path name. RTG Investigator 2.3.2 (2011-10-06) ----------------------------------- * format: Added the ability to perform base-quality read trimming, using BWA-style "best quality sum" length determination. Trimming low quality ends off reads can significantly improve the quality of Ion Torrent mappings. E.g. --trim-threshold 15. * map/mapf: Improved mapping defaults for Ion Torrent data. * map/mapf/mapx/format: Added the ability to accept input read data in SAM/BAM format, by supplying --format sam-se or --format sam-pe, for single or paired-end data respectively. The input SAM/BAM file must be sorted by query name. * mapf: reduced memory usage, particularly with large numbers of reference sequences. * mapx: add a warning when the selected parameters will result in a large number of indexes, and thus likely to give poor speed. * coverage: fix an exception when encountering third-party SAM records with IH attribute set to 0 and NH greater than 0. * sam2bam: this is a new module that specifically converts coordinate-sorted SAM to BAM. * sammerge: updated the default behaviour to not perform filtering of records marked as unmapped or PCR duplicates (the flag --include-unmapped has been replaced by --exclude-unmapped, and the flag --include-duplicates has been replaced by --exclude-duplicates) * sammerge: when the output file ends in ".bam", sammerge will produce BAM rather than SAM. NOTE: When performing snp calling with --region on a partial chromosome, you should currently enlarge your region by a read length on each end to ensure all supporting evidence is seen near the boundaries. This will be addressed in a subsequent release. RTG Investigator 2.3.1 (2011-09-12) ----------------------------------- * map/cgmap: SAM flag 0x100 (alignment is secondary) is now set for all non-uniquely mapped/mated records. * map/cgmap: SAM flag 0x8 (mate is unmapped) in unmated and unmapped SAM files now indicates whether the mate is globally unmapped (however, mate position information is not available in these records). Previously this flag was always unset in order to avoid Picard warnings about not having position information supplied, however the SAM spec allows mate position to be unspecified and the information in the flag is useful nonetheless. These warnings will now be seen if you run the Picard validation tools. * map: fix exception when using --top-random option. * all: allow '=' in sequence names as long as it is not the first character. RTG Investigator 2.3 (2011-08-31) --------------------------------- * cgmap: switch to a new aligner implementation that produces better alignments and results in a 20-30% improvement in execution time. The SAM extended attributes GC/GS/GQ containing CG specific information have been replaced by more expressive attributes XU/XR/XQ. See the user manual for more details. * map/snp: Initial Ion Torrent support. Specifying the IONTORRENT platform in the read group information during mapping will alter default alignment penalties and thresholds to better handle the Ion Torrent indels and will propagate through variant calling. * snp: ambiguity ratio (AR) and allele balance (AB) have been added to FORMAT output in VCF. Calls that are made using the complex realigning caller are now indicated as such with an XRX annotation. * snp: summary statistics have been updated to contain more useful information in a more readable presentation. * snp: removed --output-second flag which was a hangover from a previous output format and did not affect the VCF produced. * many commands: now support reading .bz2 compressed FASTA/FASTQ files. * mapx: now supports direct loading of reads from FASTA/FASTQ. * coverage/species: now includes sequence lengths in output. * coverage: produces additional coverage information regarding non-N regions. * map/mapf: performance and memory improvements when mapping against very large numbers of reference sequences. * map/cgmap/mapf: the value supplied to the --sam-rg flag may now be either the name of a file containing the read group information, or a string containing the read group information itself (tabs must be represented by the sequence \t rather than literal tab characters, see the documentation for more information). * sdfsplit: uses a disk-based SDF reader by default and have added the --in-memory flag to enable the older method (for faster processing if sufficient RAM is available). * format: added the --allow-duplicate-names flag to disable the duplicate sequence name detection (this can save large amounts of memory when formatting extremely large datasets). * sdfsplit: renamed the --disable-dupe-detection flag to --allow-duplicate-names for consistency with format. * rtg wrapper script: rtg and the java that gets invoked now share the same unix process group so that signal handling works as expected within cluster scenarios. RTG Investigator 2.2.1 (2011-07-14) ----------------------------------- * mapx: fixed an overflow problem when the number of reads times the --max-top-results setting exceeded Integer.MAX_VALUE (2^31-1). * rtg wrapper script: added safety checks for malformed cfg files (for example, it is easy to forget to include quotes when a property needs spaces). Also, the default rtg.cfg sets RTG_JAVA_OPTS to disable the JVM use of the popcount instruction until Oracle bug number 7063674 is fixed. * many commands: included a workaround for a bug in gzip decompression that is present in many recent versions of the JRE. This allows us to include a no-JRE distributable, so we can now officially support MacOSX as a platform. * EULA: permit investigators to use for evaluation; registration overview; non-competitive use only. * snp/coverage: when supplying lists of SAM files via --input-list-file, the list files are now tolerant of extra white space surrounding the file names and empty lines. Lines starting with the hash character '#' are now treated as comments and are ignored. * map/cgmap: RTG mated SAM files contain records in pairs, but in very heavy repeat regions this would occasionally be violated and the resulting SAM file would contain a SAM record for one arm but not the other. This is now fixed. * map/cgmap/mapf: Fixed rare crash that could occur when running map/cgmap with --all-hits option, or mapf. RTG Investigator 2.2 (2011-06-08) --------------------------------- Initial public release. NOTE: Non-deterministic mapping results have been observed on modern CPUs with Java versions 1.6.0_18 and newer due to a bug in the use of the popcount instruction. If your CPU implements SSE4 instructions, we recommend adding -XX:-UsePopCountInstruction to the RTG_JAVA_OPTS configuration setting to work around this. We have filed a bug with Oracle regarding this (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7063674) but there is currently no resolution.