# Release Notes for RTG Tools

Below are the release notes for the full RTG suite, upon which
RTG Tools is based.  Not all features described below may be included
in this product.

RTG Core 3.8.3 (2017-08-02)
---------------------------

This release primarily includes bugfixes and minor improvements:

* rocplot: (GUI) Improvements to graph zooming, to allow stepping back
  to previous zoom levels as well as fully un-zooming.

* rocplot: Improve the automatic curve naming heuristic to ignore
  directory name suffixes like "-eval", ".vcfeval" etc, and similar
  prefixes.

* rocplot: Enable text antialiasing in GUI and PNG output.

* vcfeval: More graceful handling of input VCFs containing REF values
  that are not valid according to VCF specifications.

* vcfmerge/vcfeval: Normalize the casing of nucleotides in REF/ALT,
  which permits merging records where the REF/ALT differ in casing.

* vcffilter: Graceful error handling of a new category of invalid
  javascript expression.

* vcfsubset: Don't complain when using --keep-filter/--remove-filter
  flags with "PASS" and the VCF header doesn't contain a declaration for
  that filter.

* misc: Prevent a unit test failure when running on newer versions of
  Ubuntu.


Previous releases
=================

RTG Core 3.8.2 (2017-06-20)
---------------------------

This release primarily includes bugfixes and minor improvements:

* vcfeval: Records where the REF/ALT contain bases not permitted by the
  VCF specification are now skipped (and reported in the log) rather
  than terminating execution.

* vcfeval: (`combine` and `ga4gh` output modes only) These modes were
  inserting a redundant VCF header entry containing the command line,
  which has been removed.

* vcfeval: GA4GH output mode now supports loose positional matching of
  variants (within +/-30bp by default, and adjustable via
  --Xloose-match-distance).

* many: Prevent number formatting issues in non-English locales. The
  locale is now forced to US.

* many: Some commands were not appending gzip termination blocks to VCF
  outputs, which could result in subsequent warning messages being
  produced by some third party tools.

* many: Improve the consistency of exception handling in cases where the
  exception is thrown in a worker thread.

* many: Attempting to supply file lists via shell process redirection
  would fail in non-obvious ways. File lists from process redirection
  are not currently supported and are now checked for up-front.

* minor: When setting up rtg bash tab completion, issue a warning if an
  incompatible completion function has already been installed. (This can
  happen on some linux distros if you have the system `bash-completion`
  package installed and attempt to tab-complete rtg before installing
  rtg bash completion.)

* minor: Fix a typo in the example configuration settings in rtg.cfg
  (specifically, RTG_JAVA_OPTS was incorrectly listed as
  RTG_JAVA_OPTIONS).

RTG Core 3.8.1 (2017-05-29)
---------------------------

This release primarily includes bugfixes and minor improvements:

* rocplot: (GUI) The right hand panel now includes a visual indication
  of the color for each curve.

* rocplot: (GUI) The color for a curve can now be set via color picker
  available from the per-curve context menu.

* rocplot: (GUI) Reordering the curves is now achieved by drag and drop
  rather than the (now removed) reorder buttons.

* misc: The RTG Tools release includes a scripts/demo-tools.sh that
  gives a quick end-to-end demonstration of simulation and VCF
  manipulation commands. This is similar in nature to the
  scripts/demo-family.sh script that is included in RTG Core.

* vcfeval: Fix an exception caused by the skipping of heterozygous
  structural variants being dependent on the GT field allele
  ordering. These variants are now correctly skipped. In previous
  releases the cases that slipped through would enter matching with a
  stub allele representing the SV allele.

* vcfeval: When running a sample-free comparison via the option
  `--sample ALT`, ignore records/alleles corresponding to structural
  variants.  In 3.8 these could produce an exception, and in previous
  releases any SV alleles present were included as a generic token
  during matching.

* vcfeval: Improve the handling of non-user exceptions encountered
  during VCF loading.  Previously these would produce an often
  inscrutable message.

* version: Update copyright year and include an alternative citation
  more appropriate for those using RTG Tools.

* popsim: Now includes the random number seed in the VCF header for
  consistency with with other simulation commands.

RTG Core 3.8 (2017-05-15)
-------------------------

Major features of this release:

* Improvements aimed at preprocessing and QC. In particular, RTG
  includes two new commands, fastqtrim and petrim, for preprocessing
  FASTQ files to apply various kinds of trimming before entering the NGS
  pipeline. These commands greatly expand what was previously available
  during data formatting.

* The suite of simulation commands that were previously only available
  as part of RTG Core have been included in the RTG Tools package. These
  commands encompass simulation of reference genomes (genomesim),
  simulation of population-level variants (popsim), individual sample
  genomes using population variants (samplesim), simulation of samples
  as member of a pedigree obeying inheritance rules (childsim),
  simulation of de-novo variants (denovosom), generation of a genome
  given a VCF of sample variants (samplereplay), and read simulation
  according to a range of sequencer parameters (readsim/cgsim).

* Initial support for accepting CRAM files as input to variant calling
  commands and most other commands that accept alignments as input. For
  some commands this may now require specifying a reference SDF in order
  to decode the CRAM files.

* Improvements to the prebuilt AVR models that perform variant
  scoring. These models have been rebuilt using training data
  incorporating the latest truth sets produced by the GIAB initiative as
  well as improvements to the underlying machine learning algorithms.

* User manual improvements, in particular the baseline progressions
  section has been rearranged to better illustrate how to run end-to-end
  RTG calling pipelines that make best use of RTG features such as
  sex-aware and pedigree-aware variant calling.

Detailed changes are listed below by area.  For more information on
new features, see the RTG Operations Manual.

## Basic Formatting and Mapping

* fastqtrim: This new command allows trimming of FASTQ files with much
  more flexibility and control than is available directly from
  format. See the user manual for more information and examples.

* petrim: This new command allows trimming of read bases in paired-end
  data where read-through has occurred, as determined by alignment
  overlap. See the user manual for more information and examples.

* format: Support for reading interleaved paired-end FASTQ added. This
  is useful for formatting directly from streamed output of the petrim
  command, avoiding additional disk I/O.

* format/map: The quality encoding for FASTQ input files now defaults to
  the sanger encoding used by the majority of modern FASTQ files, and so
  the --quality-format flag typically only needs to be specified when
  processing older FASTQ files employing an alternative encoding.

* many: When outputting FASTA/FASTQ, ensure consistent use of unix line
  endings across the various commands.

* calibrate: When calibrating multiple BAM files, each is calibrated in
  an independent thread, obeying --threads flag.

* sammerge: New flag --subsample that permits a fraction of the
  alignments through to the output.  In addition, the new flag --seed
  lets you control which seed is used for this filtering.

* coverage: Computes additional QC metrics fold-80 penalty and median
  coverage.

* coverage: New flag --per-region to which changes how BED/BEDGRAPH
  coverage records are triggered, from being whenever the coverage level
  changes, to only when the region changes.

* sammerge: Will now create output files in CRAM format if the output
  filename ends with ".cram". This requires the user to specify the
  reference SDF via the new --template flag.

* index: Now allows creating indexes for CRAM files. These are the
  `.bai` indexes currently supported by htsjdk, rather than `.crai`
  indexes.

### Variant Calling

* snp: Includes INFO.DP annotations in output VCF, for consistency with
  the existing multi-sample caller output.

* family/population/somatic: New VCF annotations (OCOC/OCOF/DCOC/DCOF)
  that indicate the count/fraction of contrary evidence observed in the
  original(parent) vs derived(child) samples.

* snp/family/population/somatic: These commands now support SAM/BAM
  files that make use of the '=' character in the SEQ field (such as can
  be created by BamUtil:convert)

* snp/family/population/somatic: These commands now support CRAM files
  as input.

* family/population: Improved error reporting for semantically incorrect
  user-supplied pedigree information.

* snp/family/population/somatic: Improvements to the accuracy of the
  pre-built AVR models. These models have been rebuilt using training
  data incorporating the latest truth sets produced by the GIAB
  initiative as well as improvements to the underlying machine learning
  algorithm.

* snp/family/population: The default AVR model is now illumina-wgs.avr
  (previously the default was illumina-exome.avr). For exome calling,
  the illumina-exome.avr model provides an advantage over
  illumina-wgs.avr only when the primary interest is maximising the
  scoring of variants called outside of exome target regions.

* many: For compatibility with non-human species, sex handling of PAR
  regions has been extended to allow the length of a PAR region in each
  member of an allosome pair to be of different length.

* svprep: Add the ability to run on merged alignment files rather than
  requiring alignment files to be separated into mated vs unmated vs
  unmapped.

* svprep: New flag --no-augment flag permits the computation of read
  group statistics files only, for use when collecting statistics from
  third party alignment files.

* avrpredict: New flag --sample to allow AVR scoring of only the
  specified sample names.

* avrpredict: New flag --vcf-score-field to allow storing the AVR score
  into a format field with a different name, useful when comparing
  multiple scoring models.

* avrbuild: Improvements to the quality of models built in the presence
  of missing annotations.

### Variant Processing and Analysis

* vcfmerge: When combining records at the same position, vcfmerge will
  now not combine records at a site where some records use a VCF padding
  base (as required by the VCF specification to prevent REF or ALT being
  zero-length) and some records do not. This is because a record which
  utilizes a padding base is not making an assertion about the genotype
  of the padding base itself, and merging these records loses this
  semantic distinction. (The old behaviour can be obtained via
  --Xnon-padding-aware.)

* vcfannotate: New flag --no-header to suppress output of the VCF header.

* vcfsubset: New flag --remove-ids to allow clearing the ID column.

* rocplot: New flag --zoom which allows the specification of an initial
  zoom to display. See the user manual for a description of the
  coordinate syntax.

* rocplot: (GUI) Add ability to remove a curve via per-curve pop-up menu
  in the side-pane.

* rocplot: (GUI) Prevent loading the same ROC data file multiple times,
  and improve error handling on invalid files.

* rocplot: (GUI) Improvements to the open file dialog. Now defaults to
  displaying ROC data files only, permits opening multiple ROC data
  files at once via multi-select, and other minor changes.

* rocplot: (GUI) The "Cmd" button now shows the command in a pop-up
  dialog rather than sending it to the terminal, which eliminates the
  need to search through multiple tmux windows to find where rocplot was
  started from.

* many: Invalid VCF header contig length specifications are now reported
  gracefully.

* many: Improved error reporting of general VCF header parsing errors,
  now include the problematic line where possible.

* many: Improved error reporting of malformed GT fields.

### Metagenomics

* species: Fix the handling of mappings that contain non-unique
  read-names (as could arise when mapping directly from FASTQ files as
  separate mapping runs and passing the resulting alignments to
  species).

* species: Accuracy improvements when using paired-end data as the
  underlying data source.

### Other

* pedstats: Improved the GraphViz pedigree visualization layout for
  normal pedigree structures. The old layout is available with the new
  ``--simple-dot`` flag.

* many: The following simulation commands are now included as part of
  RTG Tools: genomesim, cgsim, readsim, popsim, samplesim, childsim,
  denovosim, samplereplay.

* readsim: When using --taxonomy-distribution and --distribution, one of
  --abundance or --dna-fraction must be supplied in order to indicate
  the desired interpretation.

* index: the -f flag is now optional and by default index will attempt to
  determine the file format by the extension.

* many: Most commands accept the advanced flag --Xforce that allows them
  to continue in the case of pre-existing output files or
  directories. Be aware that particularly in the case of output
  directories the final directory contents may include files from
  previous runs (or even other commands), so this option should not be
  used in production scenarios.

* many: Fixed an exception that could occur when performing multiple
  region based querying of SAM/BED/VCF records, where the regions were
  densely packed near the ends of chromosomes.

* many: Almost all commands that take SAM/BAM as input now support CRAM
  files as input. Some of these commands have a new flag used to supply
  the reference SDF which is required when decoding CRAM.

* misc: The rtg bash command completion has been improved to be more
  portable and no longer caches completion data on disk.

* many: Linux and Windows packages have updated the bundled JRE to the
  latest from Oracle.


RTG Core 3.7.1 (2016-10-18)
---------------------------

This release primarily includes bugfixes and minor improvements:

* map/cgmap: Addresses a pathological case where a particular paired-end
  read pair plus reference sequence could run for a disproportionately
  long time in highly repetitive regions.

* vcfeval: Fixes a rare exception that could occur when a "too-hard"
  region occurs right at the end of a reference sequence.

* rocplot: Fixes an exception that would occur when trying to plot the
  result of evaluating a call set (containing variants) against a
  baseline containing no variants.

* rocplot: (gui) When loading several files on startup, sometimes the
  initial view would not be fully zoomed out.  We now ensure that the
  plot is zoomed out after the initial files are loaded.

* vcffilter: Fixes a regression in command line flag validation that
  would cause a talkback exception if no input file was supplied rather
  than presenting an appropriate message.

* vcfmerge: Fixes an exception that could occur when merging a mixture
  of regular VCFs containing sample columns with sites-only VCFs.

* bgzip: Fixes an exception that could occur when decompressing from
  stdin.

* Minor documentation fixes.


RTG Core 3.7 (2016-08-25)
-------------------------

Major features of this release:

* Improvements to mapping speed when aligning targeted sequencing
  data. This feature makes use of a per-reference hash blacklist which
  is constructed once per reference genome and can yield significant
  speed improvement.  In addition, several changes were made to reduce
  peak memory use during mapping.

* Variant callers now allow the optional inclusion of expected
  germline allele balance terms in the Bayesian model.  In a
  genome-wide scale, this generally results in a reduction in
  false-positive calls, although sensitivity may be reduced for
  variants which do not follow allele balance expectations, such as
  mosaic de novo variants.

* Several improvements to the somatic caller. These include the ability
  to enable output of germline variants (due to the joint calling,
  accuracy of calling germline variants during somatic calling is
  typically higher than separately calling germline variants from the
  normal sample alone). The somatic caller now has the ability to
  explicitly model the expected somatic allelic fraction, for use in
  cases where the tumor heterogeneity is expected to be low. Additional
  options allow the output of records at sites exceeding user-specified
  thresholds for non-reference evidence. We have also included an AVR
  model specifically built for somatic calling which provides more
  accurate scoring than the regular germline AVR models.

* Several improvements to the variant comparison tools.  vcfeval now
  includes the ability to evaluate matches across confident-region
  boundaries according to GA4GH recommended practise.  vcfeval can be
  used to compare against "sample-free" VCFs such as
  ExAC/COSMIC/dbSNP, and the runtime has also been significantly
  improved.  In addition, the rocplot command can now produce
  precision-sensitivity graphs, and can output SVG as a more
  publication-ready format.

Note: RTG now requires Java 8, so for those using the "nojre" RTG
download or who are building from source, make sure you have Java 8
installed.

Detailed changes are listed below by area.  For more information on
new features, see the RTG Operations Manual.

## Basic Formatting and Mapping

* format: Automatically installs reference genome configuration
  information when a recognized reference genome is being formatted to
  SDF. Also outputs a reminder for those cases where it looks like a
  reference genome is being formatted but which is not one of the
  recognized genomes.

* sdf2cg: New command to allow the export of Complete Genomics data
  that has been formatted as SDF to Complete Genomics TSV read format.

* map/cgmap: TLEN was not being correctly computed in the presence of
  soft clipping and back steps. This has now been corrected.

* map/cgmap: Several reductions in peak memory use during mapping.

* map: Significant speed improvement when mapping highly targeted
  sequencing data, using the mechanism of a repetitive hash blacklist.
  This is enabled via the new flag --reference-blacklist. A separate
  tool 'hashdist' is used for this one-off blacklist construction.

* hashdist: New command that can be used to analyse the uniqueness of
  k-mers contained within a reference sequence and to produce a
  reference hash blacklist.

* calibrate: New flag --exclude-bed and --exclude-vcf can be used to
  exclude sites of known genomic variation during the computation of
  calibration data. It is not currently possible to specify this
  information to the automatic calibration that is carried out during
  mapping, this will be added in a future release.

### Variant Calling

* snp/family/population/somatic: These callers expect calling to be
  carried out on alignments that have had calibration information
  computed. They now requires the explicit use of the --no-calibration
  flag in order to proceed anyway.

* snp/family/population/somatic: These commands now output a warning
  if too many "excessive coverage" situations are encountered, as this
  usually signifies that the user has incorrectly calibrated their
  mappings or has failed to supply an appropriate coverage parameter
  to the caller. In addition, these commands output a warning if it
  appears that calibration has not been computed from correct regions
  for targeted data.

* snp/family/population/somatic: New flag --min-base-quality which
  allows explicit ignoring of base calls which do not meet the
  specified minimum phred quality score. These bases will be treated
  the same as an N and will not contribute to allele counts.  The
  default is to consider all bases.

* family/population/somatic: The semantics of --max-coverage has
  changed from being the total coverage across all samples, to being
  the average per-sample coverage.  This flag is typically only used
  when running without calibration, and this change makes the default
  behaviour more scalable with varying numbers of samples.

* snp: An explicitly specified --ploidy flag now overrides the ploidy
  obtained from reference genome configuration (if present).
  Previously the ploidy specified in the reference genome would take
  precedence.

* snp/family/population/somatic: Fixed an incorrect (and sometimes
  non-deterministic) computation of the PUR FORMAT annotation. This
  does not affect primary calling but could result in changes in AVR
  score.

* snp/family/population/somatic: Updated the Bayesian model to include
  a term for the expected allele balance. This is disabled by default,
  and can be enabled with the new flag --enable-allelic-fraction. This
  option gives improved precision for regular germline calling, but
  sensitivity to mosaic variants or those within CNV regions may be
  reduced.

* snp/somatic: The new flags --min-variant-allelic-depth and
  --min-variant-allelic-fraction can be used to enable output at sites
  where these thresholds are met, even if the caller would not
  otherwise make a call. Note that this does not act as a filter to
  prevent the caller from output at sites where these thresholds are
  not met.

* somatic: New flag --include-germline which instructs the somatic
  caller to also output variants which have been identified as
  germline variants.

* somatic: New flag --enable-somatic-allelic-fraction which instructs
  the Bayesian model to include a term for the expected somatic
  allelic fraction in the calling.  This flag is most appropriate when
  tumor heterogeneity is low.

* somatic: A new pre-built AVR model is provided for somatic calling
  which provides better scoring for somatic variants than the regular
  AVR models. This new model, "illumina-somatic.avr" is selected by
  default by the somatic caller.

### Variant Processing and Analysis

* vcfsubset/vcffilter: New flag --no-header which omits the output of
  the VCF header.

* vcffilter: New option --keep-expr to allow filtering records based
  on simple JavaScript expressions with natural VCF field access. For
  example 'NA12878.DP > NA12892.DP' to select records from a trio
  call-set where the depth of NA12878 is greater than that of her
  mother. See the user manual for more information and examples.

* vcffilter: New option --javascript to allow advanced filtering and
  other processing of the VCF file using powerful JavaScript
  filters. These scripts can contain initial setup, per-record
  actions, and end functions. See the user manual for more information
  and examples.

* vcfeval: Specifying a sample name of ALT for either the baseline or
  call sample name instructs vcfeval to match against all possible
  non-ref diploid (or haploid if using --squash-ploidy) genotypes
  possible from the declared ALTs. This permits matching against a VCF
  that contains no sample column, for example to find hits against a
  sample-free VCF such as ExAC or COSMIC.

* vcfeval: New flag --evaluation-regions, which adds support for
  matching across high-confidence/false-positive regions such as those
  supplied with GIAB or Illumina Platinum Genomes truth sets according
  to GA4GH recommendations. In summary, only matches against baseline
  variants within these regions count as true positives and only
  non-matched call variants made within these regions count as false
  positives.

* vcfeval: Now outputs additional true positive statistics for the
  unweighted calls, so you can see the simple count of true positives
  in call representation.  When computing precision, this uses the
  unweighted call count in the denominator, to reduce representation
  bias in the precision.

* vcfeval: Significant speed increase (often 2x speed up for typical
  WGS comparisons).

* vcfeval: New output mode 'roc-only' which skips the output of VCF
  files and only produces the ROC data files and summary metrics. This
  reduces run-time and the size of the output directories when doing
  many runs.

* vcfeval: Command line score field specification permits INFO.<name>
  form, for consistency with JavaScript expression notation, although
  the old form of INFO=<name> is still supported.

* rocplot: Added the ability to plot precision-sensitivity graphs via
  the new flag --precision-sensitivity.  In the interactive GUI the
  graph type can also be changed on the fly via a dropdown chooser.

* rocplot: Added the ability to output images in SVG format, both in
  non-interactive mode via the new flag --svg, and when saving images
  from the interactive GUI.

* rocplot: Improved the default labelling of curves by including the
  score field if available.

* rocplot: The curve palette size has been increased in order to allow
  easier differentiation when more than 8 curves are being displayed
  at once.

* rocplot: (GUI) Fixed an annoying bug that could occur when trying to
  edit the title of the plot or of the curves. Several other minor GUI
  improvements have been made, such as the ability to use the
  mouse-wheel to scroll large lists of curves.

### Other

* aview: Now defaults to showing base colors in the terminal. Use
  --no-base-colors to disable this.

* aview: Better error handling for invalid SAM records.

* aview: New flag --print-soft-clipped-bases to display soft-clipped
  bases.

* chrstats: New flag --output-pedigree that can be used to create a
  default pedigree file based on the mappings of multiple samples,
  using inferred sample sex where possible.

* many: In several cases where a flag could be specified multiple
  times, it is now possible to supply a comma separated list of
  values. These are indicated in the output of --help.

* many: Most utility commands which write VCF files now do so
  asynchronously, often resulting in significant speed improvements.

* all: The distribution now includes an HTML version of the operations
  manual in addition to the PDF version.

* all: The minimum Java requirement for RTG is now Java 8.


RTG Core 3.6.2 (2016-03-10)
---------------------------

This release primarily includes bugfixes and minor improvements:

* map: mapping very large numbers of reads in a single chunk or with
  low step size settings could exceed some internal datastructures,
  giving unpredictable results. An explicit check for these conditions
  has been added.

* map: Reduction in peak memory use when mapping paired-end data.

* vcfeval: Better error handling for variants which have triploid or
  higher GT (ploidy higher than 2 is not supported).

* extract: Extracting multiple regions from SAM/BAM across different
  chromosomes could cause an exception.

* rocplot: Improved error handling for yet more ways in which
  attempting to open a GUI from a headless server can fail.

* rocplot: (GUI) Minor improvement to crosshair handling.


RTG Core 3.6.1 (2016-01-25)
---------------------------

This release primarily includes bugfixes and minor improvements:

* coverage: Fixed an exception that could occur when supplying a
  reference SDF that did not contain all the sequences present in the
  alignments.

* family: Fixed an exception that could occur when supplying a family
  pedigree involving members not present in the input mappings.

* population: The COF/COC annotations for de novo calls that were
  recently added to the family caller are now also produced by the
  population command when appropriate.

* map/cgmap: When mapping pre-formatted reads containing SAM read
  group information embedded in the SDF and the input format was
  explicitly specified as SDF via -F sdf, the read group info wasn't
  being picked up. This is now fixed.

* vcfmerge: Speed improvement when merging VCF files containing a
  large number of contig header declarations.

* many: Speed improvement when accessing indexed datafiles
  (e.g. BED/BAM/VCF) that were being filtered by very large sets of
  regions.

* rocplot: Better error handling when trying to run the GUI on a
  machine where a graphics environment is unavailable.

* rocplot: (GUI) Update frame title when graph title changes.

RTG Core 3.6 (2015-12-07)
-------------------------

Major features of this release:

* Further improvements to somatic variant calling which significantly
  reduce the number of false positive calls while retaining somatic
  calling sensitivity.  These improvements are achieved by
  incorporating the presence of somatic-allele-supporting evidence in
  the normal into the Bayesian computation.  Additional VCF
  annotations quantifying these "contrary observations" are included
  in the output.

* De novo variant detection in families and pedigrees now incorporates
  similar techniques for a reduction in false positives.

* Our support for aligning and variant calling with reads produced by
  Complete Genomics Inc has been extended to their newer 29 base-pair
  read structure (these reads consisting of 10-9-10 sub-reads are
  often represented as 30 base-pairs with a redundant N).

* Several improvements to variant comparison with vcfeval, including
  the improved handling of call sets containing overlapping variants,
  and the ability to select alternative output modes depending on the
  desired analysis workflow.

Detailed changes are listed below by area.  Please read these through
fully, as some command-line flags have changed, so updates to your
pipeline scripts may be required.  For more information on new
features, see the RTG Operations Manual.

## Basic Formatting and Mapping

* cg2sdf: Add support for formatting CGI TSV reads files containing
  their version 2 reads.  These reads are typically represented as 30
  base-pair arms (10-10-10 subread structure containing a redundant N
  which is removed during formatting), although 29 base-pair arm
  representation (10-9-10 subread structure) is also supported.

* sdf2cg: This new command allows exporting SDF formatted Complete
  Genomics read data to their TSV reads file format.

* cgmap: Now supports aligning the version 2 read structure.  When
  aligning CGI reads, an appropriate indexing mask must be selected
  which is appropriate for the type of reads being mapped, so --mask
  is now a required flag.

* cgmap: Mask names have been changed to more clearly indicate which
  version of CGI reads they are applicable to.  Available masks are
  now "cg1" (formerly named "cgmaska15b1"), "cg1-fast" (formerly named
  "cgmaska1b1", and "cg2" (a new mask for use with version 2 reads
  which roughly equivalent in sensitivity to "cg1-fast").  Additional
  masks may be available in future.

### Variant Calling

* somatic: Features an improvement to the Bayesian calculation to
  better account for the presence of contrary evidence.  This has
  resulted in a large reduction in false positives while maintaining
  sensitivity.

* population/family: These pedigree-based callers now contain similar
  adjustments to the Bayesian calculation to better account for
  contrary evidence of de novo variants.  This has resulted in a large
  reduction in false positive de novos while maintaining sensitivity.

* somatic/family/population: These callers produce additional
  annotations in their output VCF that indicate the degree of contrary
  observations for the novel allele.  The COC annotation contains a
  simple count of the number of contrary observations and the COF
  annotation contains the contrary observations as a fraction of total
  observations.  Users who wish to adjust the sensitivity/precision
  tradeoff of their de novo call sets may wish to use these attributes
  for filtering.

* family/population: The marking of equivalent complex calls was not
  functioning for sex-aware calling on the Y chromosome when both
  males and females are present, resulting in occasional additional
  equivalent but differently represented variants present in the
  output.

* population: Better error handling when a the user supplies a
  pedigree that contains cycles.

* avrbuild: The new COC and COF annotations are now available as
  derived annotations that can be used in model building.  One
  interesting use of these attributes may be to build AVR models
  specifically for predicting the correctness of de novo predictions.

* snp/family/population/somatic: These variant callers all now include
  support for CGI 29 base-pair read structure.

* snp/family/population/somatic: The pre-built AVR models distributed
  with RTG have all been rebuilt using current annotations and updated
  training data.

### Variant Processing and Analysis

* vcfannotate: New option --relabel allows sample names in a VCF to be
  changed.

* vcfsubset: New flag --remove-qual to reset the QUAL field to '.'

* vcfsubset: Fixed a bug where encountering a VCF record that did not
  contain any FORMAT field specified in --keep-format would cause all
  subsequent records to be dropped.

* vcffilter: For convenience the existing flags --keep-format,
  --remove-format, --keep-samples, etc. now support comma separated
  lists, For example: --keep-format GQ,AVR.

* vcffilter: New flag --remove-hom to exclude records where a sample
  was called as homozygous.

* vcfeval: New additional output modes that allow the selection of
  output files that best suit the desired workflow.  These are
  controlled via --output-mode flag and there are currently three
  options available: split (the default, equivalent to previous
  behaviour), annotate (outputs baseline and calls files augmented
  with match status annotations), and combine (provides a simple
  side-by-side two-column VCF). For more information, see the user
  manual.

* vcfeval: Removed option --baseline-tp, as the output of the baseline
  version of true positive variants is now always performed.  When using
  the default (split) output mode, these are output to tp-baseline.vcf
  as before.

* vcfeval: Added the ability to detect those FP and FN which have
  common alleles (e.g.: zygosity errors).  Previously this could be
  done manually by running vcfeval a second time using --squash-ploidy
  on the fp.vcf and fn.vcf of an initial comparison, but now it is
  automatically performed when running the new annotate or combine
  output modes.

* vcfeval: New flag --ref-overlap to allow matching variants where the
  alleles would overlap as long as the overlap bases are the same as
  ref.  Unambiguous VCFs should not need this option, but such cases
  can arise when using unsophisticated callers or VCF merging tools.

* vcfeval: Weighted ROC files now include a final data row that
  includes the statistics corresponding to no threshold application
  (and this includes any variants that were processed during path
  finding but which do not contain any ROC score field).  In an ROC
  plot, this final point may be visible as a "tick" at the end of the
  curve.

* vcfeval: The set of ROC data files that are produced are now for the
  following three subsets of calls: all calls, snps only, and non-snps
  only (e.g. indels, MNPs).  Some users were doing separate runs of
  vcfeval on input sets filtered by category in order to get separate
  statistics for snps vs indels, an approach which is prone to
  misclassification of complex variants.

* vcfeval: When processing multi-sample VCF files, it is now possible
  to specify different sample names for baseline vs callset, via the
  form: --sample baseline_sample,calls_sample.

* vcfeval: Fixed a rare bug where if the input VCFs contained multiple
  variants with the same reference position and length, the output
  VCFs could contain the incorrect variant.

* vcfeval: Fixed a crash that could occur when the input set contained
  a variant that extended off the end of the reference sequence.

* rocplot: (GUI) Fix several minor issues: initial paint was not laid
  out correctly; very small ROC files would not display status info;
  some UI layout improvements; and add a small amount display padding.

* rocplot: (GUI) Malformed ROC data files now show an error dialog.

### Metagenomics

* similarity: This tool will now make use of available taxonomy
  information in the case of a single supplied SDF, in order to allow
  the easy computation of a neighbour joining tree from a reference
  species database (or subset thereof).

### Other

* sdf2fasta/sdf2fastq: New flag --interleave to permit output of
  paired end data to a single output in interleaved fashion
  (i.e. alternating left and right arms).  This allows piping paired
  end data for simple command-line processing (although there is also
  sdf2sam which may be more applicable depending on the processing
  desired).

* cgsim: Added support for simulating reads with the CGI version 2
  read structure, controlled via a new flag, --cg-read-version.

* readsim: Add support for both versions of CGI read structures.  Use
  --machine complete_genomics (the original 35 base pair read
  structure) or --machine complete_genomics_2 (the newer 29 base pair
  structure).

* aview: New flag --unflatten to display unflattened CGI reads when
  present.  At present only version 1 reads can be displayed in
  unflattened form.

* misc: bash completion for RTG commands and options now works on Mac
  OS X (see scripts/rtg-bash-completion for instructions).

* misc: The underlying htsjdk library used for SAM/BAM support has
  been updated to version 1.141.

* many: The JRE bundled with Linux/Windows builds is now 1.8.

RTG Core 3.5.2 (2015-10-15)
---------------------------

This release primarily includes bugfixes and minor improvements:

* many: When piping results from one command to another, and a later
  command closes the pipe (e.g. head), this scenario no longer
  produces an "Broken pipe" error message. This is consistent with the
  behaviour of commonly used command-line tools.

* rocplot: Updated to handle ROC data files that contain lines with
  non-numeric score field. (In particular, future versions of vcfeval
  will include additional data-points corresponding to variants with
  no score provided)

* rocplot: (GUI) Improvement to usability for curve renaming. Now a
  single-click in the curve title area enters edit mode, with
  RETURN/TAB to accept, ESC to cancel.

* rocplot: (GUI) Add a button that prints an equivalent command line
  to the terminal, for easy restarting with similar state,
  particularly if curves files have been added interactively..

* cgmap: Fix for sample sex being ignored when supplied via a pedigree
  file rather than using explicit sex flag.

* misc: Removed vestigial (and in RTG Tools' case, incorrect)
  "Licensed to:" line from the version command output.

* misc: Add BSD license text to the RTG Tools distributable zip.

RTG Core 3.5.1 (2015-09-07)
---------------------------

This release primarily includes bugfixes and minor improvements:

* coverage: Fix an exception that could occur if running with a
  reference SDF supplied that had chromosomes in a different order
  compared to the BAM sequence dictionary (typically this could occur
  when running coverage on third-party BAMs)

* extract: When extracting multiple regions these regions are now
  sorted.

* vcfeval: When an entire chromosome contained only baseline or only
  called variants, the summary statistics for FP/FN were not being
  incremented correctly.

* vcfeval: Fixed a case where path-finding could get confused and drop
  variants.

* vcfeval: Speed improvement in post-processing.

* many: Improved error reporting for commands that involve processing
  multiple BAM files, so that the name of the particular file causing
  the problem is included.

* wrapper: Fixed the java version number check so that it works
  correctly with openjdk 1.8

RTG Core 3.5 (2015-07-16)
-------------------------

Major features of this release:

* Several improvements to somatic calling, including the ability to
  specify site-specific somatic priors, control of output for
  gain-of-reference and loss-of-heterozygosity events, and changes to
  the VCF according to TCGA VCF specification:
  https://wiki.nci.nih.gov/display/TCGA/TCGA+Variant+Call+Format+%28VCF%29+1.2+Specification+-+Unofficial
  Note that these changes in VCF format compared to previous versions
  may require users to update their existing scripts for the changes.

* Improvements to variant evaluation with vcfeval, primarily the
  ability to perform evaluation restricted to individual regions or
  sets of regions (for example GiaB high-confidence intervals or exome
  target regions), as well as the inclusion of more accuracy metrics,
  both as a new summary file and included in the weighted ROC data
  file.

* Improvements to metagenomic species reference database
  management. Several new options allow better customization of a
  species reference, and extraction of genomic information for
  individual species contained within the reference database.

Detailed changes are listed below by area.  Please read these through
fully, as some command-line flags have changed, so updates to your
pipeline scripts may be required. For more information on new
features, see the RTG Operations Manual.

### Basic Formatting and Mapping

* format/map: When formatting or mapping reads supplied as SAM/BAM
  input data, any alignments marked as supplementary are ignored.
  Note that if the input data has already been aligned, it is
  recommended that the BAM file be shuffled to avoid biases during
  mapping arising from the data being presented in chromosomal
  order. See the user manual for more information.

* sdf2fasta/sdf2fastq: These commands have new flags --names and
  --id-file that operate the same as their counterpart in sdfsubset.

* sdfsubset: This command has new flags --start-id and --end-id that
  allow specifying a range of sequences by ID.

* sdf2sam: This new command to allows the extraction of reads from SDF
  in the form of unaligned SAM/BAM.  This has a benefit over
  extraction as FASTQ in that some metadata (such as read group
  information) is preserved, paired end data is stored in a single
  file, and quality encoding is inherent in the format.

* chrstats: Reduce false positives in sex inconsistency detection that
  were due to applying the (tighter) sex-chromosome threshold also to
  autosomes. This threshold is now applied to sex-chromosomes only.

### Variant Calling and Analysis

* somatic: Now allows the user to specify a BED file containing
  per-site somatic priors, which can be used (for example) to reduce
  the somatic prior at sites typical of false positives (e.g. presence
  in dbSNP) or increase the somatic prior at sites known to harbour
  somatic variants (e.g. presence in COSMIC).  For more information
  see the user manual.

* somatic: At the end of variant calling, the somatic caller produces
  an estimate of somatic sample contamination.  Previously this
  estimate was only available in the log file, but in this release
  this computation has been greatly improved, and the contamination
  estimate is now included in the standard summary statistics.

* somatic: "Gain of reference" calls are now disabled by default.
  These can be included by specifying the new flag
  --include-gain-of-reference.

* somatic: Calls that are indicative of loss of heterozygosity (LOH)
  calls are not produced by default (since loss of heterozygosity
  analysis is most useful in conjunction with additional data such as
  germline variant calls or CNV data).  These calls can be produced if
  desired by specifying --loh with a prior greater than 0).

* somatic: When LOH calls are enabled, previously they were output in
  haploid GT representation, now they use the ploidy appropriate for
  the chromosome (according to the reference), for compatibility with
  downstream processing tools.

* somatic: VCF output changes to bring the somatic representation in
  line with TCGA 1.2 VCF specification. In particular:

  * Calls include a new FORMAT field SS that indicates the somatic
    status for the derived (tumor) sample. This field replaces the
    previous SOMATIC INFO field.

  * Calls include a new FORMAT field SSC which contains the somatic
    score for the derived (tumor) sample. This field replaces the
    previous RSS INFO field.

* lineage: Supports the input of pedigree in the form of VCF header
  annotations as output by the somatic caller, in the form:

  ##PEDIGREE=<Derived=TUMORSAMPLENAME,Original=NORMALSAMPLENAME>

* population: Fixed a rare case where sometimes after complex call
  simplification, the only sample genotype containing a non-ref allele
  was a member of the pedigree not being output, and in this case the
  QUAL score was the 10log10 prob(no variant) rather than 10log10
  prob(variant) as required by the VCF specification. This has been
  addressed.

* vcfmerge: Added a new flag --force-merge-all to always attempt to
  merge headers containing conflicting descriptions.

* vcfmerge: Previously vcfmerge would not process records containing
  symbolic alleles. These are now accepted.

* vcfmerge: More graceful handling when encountering records with a GT
  that refers to a non-existent ALT.

* vcfeval: Now outputs a summary containing various accuracy
  metrics. A first set of statistics is computed from the full set of
  variants evaluated (these will typically have highest sensitivity
  but potentially poor precision if the input call set has not been
  filtered). A second set of statistics is computed based on the ROC
  curve information, selected at a threshold which maximises the
  F-measure statistic (this provides some balance between sensitivity
  and precision, so may be a fairer point to gather statistics for
  cross-caller comparison).

* vcfeval: The weighted_roc.tsv file now includes columns containing
  additional accuracy metrics.

* vcfeval: Improved the detection that alerts the user when chromosome
  names are incompatible between reference, baseline, calls, and bed
  regions (if used). Improvements to other error and warning messages.

* vcfeval: Added a new flag --bed-regions to supply a BED file
  containing a list of regions that the VCF records must overlap with
  in order to be included in analysis.  For example, a common use case
  is to restrict to only evaluating calls contained within the GIAB
  high-confidence regions, or only within regions corresponding to
  exome target regions.

* vcfeval: Added a new flag --region to specify a single region to
  evaluate variants within. This is useful when evaluating calls on a
  single chromosome or within a small region of interest.

* vcfeval: Fixed a case where a ref-only call (i.e. containing no
  alts) could get output instead of an indel with a padding base at
  the same position.

* vcfeval: Disabled the output of slope analysis data files by default,
  as these are fairly special purpose (primary ROC files are still
  output). They can be re-enabled if desired by using the new
  expert/experimental flag --Xslope-files.

* vcffilter: The --remove-all-same-as-ref flag now does not consider a
  sample with missing GT as being variant, since the intent of this
  flag is to retain only records where at least one sample is called
  as variant.

* vcfannotate: Added two new flags --info-id and --info-description to
  allow specifying the name of the INFO ID and Description fields
  added to the header during annotation. These flags only take effect
  if the VCF header does not already contain an INFO declaration with
  that ID.

### Metagenomics

* taxfilter: Added a new flag --subtree which allows selecting entire
  taxonomic subtrees for inclusion in the output taxonomy.

* taxfilter: Added a new flag --remove-sequences to allow the removal
  of sequence data associated with specific taxon ids.

* sdf2fasta: Added a new flag --taxons to allow interpreting any
  supplied ID as a taxon ID and all sequences assigned to such taxon
  ID will be output. This provides an easy way to extract genomic
  sequence for any species from the reference SDF.

### Other

* genomesim: Added a new flag --prefix to specify a prefix for
  generated sequence names.

* many: Update the base library used for SAM/BAM input and output to
  htsjdk 1.128.

* many: VCF reading now detects cases where a header specifies a field
  declaration using an ID that is already in use, preventing duplicate
  header declarations.

* extract: Fix a regression where extracting from VCF without any
  region specified would include the VCF header.

RTG Core 3.4.5 (2015-05-22)
---------------------------

This release primarily includes bugfixes and minor improvements:

* somatic: If the input mappings contained unmapped records with
  assigned coordinates, these were erroneously being included as
  evidence, resulting in spurious calls when calling with non-zero
  contamination specified.

* vcfeval: Implemented an algorithm optimization that permits the
  evaluation of situations that previous versions would skip over as
  being too-complex (primarily where there were long runs of abutting
  variants), as well as yielding a general speed improvement.

* avrbuild: Add checks that the user has specified at least one VCF
  annotation for use as a predictor attribute.

* vcffilter: Fix bug when filtering on the FILTER declared last in the
  header for files that contained inadvertent duplicate FILTER header
  declarations or containing an explicit declaration for the PASS
  filter.

* rocplot: Minor improvements to file chooser handling, and also
  include F-measure as an additional accuracy statistic in the status
  bar.

RTG Core 3.4.4 (2015-04-20)
---------------------------

This release primarily includes bugfixes and minor improvements:

* vcffilter: The --keep-filter and --remove-filter options now
  recognize '.' as a value that can be filtered on.  For example, to
  keep only variants that have a FILTER column that corresponds to
  non-filtered, use -k . -k PASS.

* vcfeval: Enabled skipping over more extremely complex edge cases
  that could otherwise cause exceedingly long computation times.

* rocplot: Add the ability to click on a point within the graph to
  show in the status bar the true positives / false positives /
  precision / sensitivity scores equivalent to that point.

* rocplot: The individual curve sliders that can be used to simulate
  the effects of various threshold cut-offs did not work very well for
  curves corresponding to scores with very wide ranges and non-uniform
  distribution (such as GQ and QUAL often are). These sliders are
  improved so they work better with these curves, and adjusting the
  sliders also displays accuracy metrics in the status bar, to aid in
  threshold selection.

* rocplot: It was sometimes possible to zoom in to negative
  coordinates.

* aview: Fix display of BED regions that do not have a region name
  contained within the BED file.

RTG Core 3.4.3 (2015-03-19)
---------------------------

This is primarily a bugfix release:

* map: Fixed a crash that could occur when mapping without any sample
  sex specified but when using a reference genome containing
  chromosome sex information.

* somatic: Fixed a rare crash that could occur when calling across
  blocks of Ns when the only hypothesis presented by the reads was a
  deletion of sufficient length.

* vcfeval: Improved handling in situations where variants are so dense
  within a region that there are too many possible haplotypes to
  feasibly resolve. Previously operation would abort, now a warning is
  issued and both baseline and called variants within that region are
  ignored.

RTG Core 3.4.2 (2015-03-02)
---------------------------

This is primarily a bugfix release:

* somatic: Fix a crash that could occur when calling across Ns in the
  reference.

* snp/family/population/somatic: Under some circumstances, I/O
  exceptions could trigger a crash talkback rather than being
  presented as a regular user-level error message for the user to act
  on.

* chrstats: Fixed inconsistent output destinations between
  single-sample vs multiple sample case, and do not create a log file
  for this command in the current directory.

* chrstats: Detect when the user has not set up a reference SDF with
  chromosome specification information and provide an appropriate
  error message indicating how to correct the situation.

* many: Improved error handling when requesting indexed region
  retrieval of BED/VCF/SAM files for coordinates outside the range
  that can be addressed by tabix/bam indexes.

* many: Improved error handling when errors are encountered during VCF
  header parsing, providing more information on where the problem was.

* many: Improved error handling when errors are encountered during
  tabix indexing.

RTG Core 3.4.1 (2015-01-22)
---------------------------

This is primarily a bugfix release:

* snp/family/population: Fixed a crash that could occur when calling
  across blocks of Ns when the only hypothesis presented by the reads
  was a deletion of sufficient length.

* snp/family/population: When calling across blocks of Ns, under some
  circumstances no variant call would be made.

* snp/family/population: Extremely large GQ and DNP FORMAT values are
  now capped at the maximum permitted by BCF (2147483647). Previously,
  values above this could occasionally trigger a crash.

* wrapper: Changes to streamline the first run configuration and to
  bring Unix and Windows wrappers closer to equivalence, including
  clearer instructions of how to customize initial
  configuration. Crash reporting is now opt-out rather than opt-in.

* unix wrapper: When the operating system fails to allocate memory to
  the JVM (typically due to other memory-intensive processes running
  on the same machine) this is now presented as a user message, rather
  than triggering a crash report talkback.

* many: input list files are now validated during loading rather than
  after loading the list. This gives much better error handling in the
  case where a user accidentally gives the name of an alignment file
  as an input list file.

* Other improvements and cleanups to documentation.

RTG Core 3.4 (2014-12-20)
-------------------------

Major features of this release:

* Added the ability to run variant calling only on a list of regions
  provided via BED file.  This results in a large speed improvement
  when performing exome variant calling, by avoiding computation
  associated with off-target locations, as well as permitting fast
  variant calling of target sites from whole genome data, or running
  variant calling in haploid mode in areas of loss-of-heterozygosity.

* Added the ability to perform variant calling for sites where the
  reference is unknown but where reads have been mapped. This can be
  used to fill in gaps in draft reference assemblies.  This includes
  both sites where an N is observed in the reference, larger N-blocks
  where reads have been mapped spanning the N block, and large
  N-blocks where reads are anchored on one side by known reference.

* Workflow improvements to human pipeline processing to identify
  mislabelled samples or incorrect pedigree.  At the end of read
  mapping, average coverage levels across chromosomes are examined and
  a warning is issued if there appear to be gross chromosomal
  abnormalities or if the coverage levels do not match expected levels
  for the sex of the individual specified. A standalone tool for this
  is also provided.  Similarly, the mendelian analysis tool now
  computes concordance with pedigree and issues a warning if low
  concordance indicates a parent or child is inconsistent with the
  supplied pedigree.  In addition we have added two commands for
  manipulating, extracting information from, and summarizing pedigree
  files.

* New commands for metagenomics taxonomy and reference database
  management.  Previously using metagenomics databases other than those
  pre-built by RTG was difficult and error-prone.  Three commands have
  been added to allow taxonomy construction starting from a NCBI
  taxonomy dump, filtering the taxonomy based on user criteria, and
  validating the structure of a metagenomics species reference
  database.


Detailed changes are listed below by area.  Please read these through
fully, as some command-line flags have changed, so updates to your
pipeline scripts may be required. For more information on new
features, see the RTG Operations Manual.


### Basic Formatting and Mapping

* map/cgmap/mapf: As an alternative to supplying --sex to specify the
  sex of the individual being mapped, you may specify a pedigree file
  containing the sex information for the sample.  This requires you to
  have either formatted the read set with read-group information or to
  supply read group information at mapping time (the advantage of this
  feature is that it lets you minimize the number of command-line
  differences for each sample being mapped).

* map/cgmap: When mapping using a reference containing sex chromosome
  information, average per-chromosome coverage information is used to
  issue warnings when it is likely that the incorrect mapping sex has
  been specified or if any autosomes have abnormal coverage levels
  (perhaps indicating a chromosomal abnormality).  This feature
  requires you to be using a reference genome SDF containing chromosome
  information, as described in the RTG Operations Manual.

* chrstats: New command to perform standalone average coverage
  reporting and checking against expected coverage levels from
  calibrated mapping files.  This is essentially the same check that is
  performed during mapping, but allows multiple mapping files to be
  provided (either if multiple mapping runs were performed for a
  single sample, or for batch reporting for multiple samples).

* calibrate: New option --merge to allow merging multiple alignment
  files into a single output file while performing calibration.  For
  example, this can reduce the number of I/O operations needed to go
  from multiple, uncalibrated, unindexed third party input files to a
  single calibrated indexed BAM file.

* calibrate: New option --threads to allow calibration of multiple files to
  use multiple cores.  (Currently this option only takes effect when
  used with the --merge option, not regular multi-file calibration)


### Variant Calling

* snp/family/population/somatic: New flag --bed-regions, adds the
  ability to only perform calling on the regions specified via a BED
  file.  This is more efficient than applying BED filtering via
  --filter-bed.  However note that the results can sometimes differ,
  due to edge effects of complex calling regions that cross region
  boundaries.

* snp/family/population/somatic: Implemented variant calling across
  N's in the reference.  (This was previously occurring in some cases
  where mappings across the N contain indels, but has now been fully
  implemented).  Calls where the reference is not a valid allele due to
  containing an N are annotated with an NREF INFO tag for easy
  filtering, and neither contain QUAL or GL values.

* snp: As an alternative to supplying --sex to specify the sex of the
  individual for variant calling, you may specify a pedigree file
  containing the sex information for the sample.  This can reduce the
  number of command-line differences when processing multiple samples.

* family/population/somatic: Better error handling when input mappings
  contain a record that does not correspond to one of the samples
  being called.

* snp/family/population/somatic: Fixed a hang that could occur when
  trying to clean up after an out-of-memory error.

* snp/family/population/somatic: Fixed a rare crash that could occur
  at the end of chromosomes.

* somatic: Previously stored a somatic score indicating the likelihood
  of the variant being a somatic variant in the QUAL field.  This is
  not strictly according to the VCF spec, so this score has been moved
  to the new NCS INFO field.

* vcfannotate: The --fill-ac-an flag now does not add an AC annotation
  when no ALTs are present in a record.

* vcffilter: New flag --region to extract and filter only the variants
  contained within a single specified region.

* vcffilter: New flag --bed-regions to extract and filter only
  variants contained within the regions contained in a BED file.

* vcffilter: Better error handling when applying criteria that require
  GT be present to files that are missing the GT field.

* vcfmerge: The default behaviour has changed when merging variants at
  the same position where the ALTs are different and the variants
  contain FORMAT fields that cannot be automatically be merged
  (Number=A,G,R, or the special case of the AD FORMAT field).  Now
  these FORMAT fields are removed to allow the merge to proceed.  There
  is a new flag --preserve-formats to instead output separate variants
  that keep those FORMAT fields.

* vcfeval: New flag --baseline-tp that allows additionally outputing
  the baseline version of true positive variants (the regular tp.vcf
  contains the called representation of true positive variants).

* vcfeval: --squash-ploidy treats heterozygous calls in baseline and
  calls as homozygous ALT to allow a lenient comparison.  Note that
  genotypes at multi-allelic sites where neither allele is REF simply
  choose the ALT with the highest index.

* vcfeval: Fixed an exception that could occur when processing variant
  missing GT information for some samples.

* vcfeval: Fixed an exception that could occur when provided variants
  that were outside the bounds of the supplied reference genome

* vcfeval: Fixed an inconsistency when handling ROC files in locales
  where ',' is the decimal separator.

* mendelian: The default is now to perform checks only on non-failing
  variants. The --pass flag has been removed, and a new flag added
  --all-records in order to obtain the behaviour of checking all
  variant records regardless of filters.

* mendelian: Now performs concordance checking to detect sample
  mislabelling and incorrect pedigree.

* mendelian: Removed --male and --female flag, which were only needed
  for VCFs produced by versions of RTG prior to 2.7.  If required,
  alternative pedigree information can be supplied via the --pedigree
  flag.


### Metagenomics

* ncbi2tax: New tool to generate an RTG taxonomy file from NCBI
  taxonomy dump.

* taxfilter: New tool for the custom filtering of taxonomy files and
  metagenomic reference SDFs containing taxonomy information.

* taxstats: New tool for verifying the contents of a metagenomic
  reference SDF.


### Other

* sdfsubseq: The output sequence name is the same as the input
  sequence if the coordinates are unchanged.

* many: Added the ability to read BED from stdin by specifying '-' as
  the BED file name (this is not supported in cases where a region
  restriction is also being applied to the file, as this would require
  the BED to be tabix indexed)

* many: Added the ability to read VCF from stdin by specifying '-' as
  the VCF file name (not supported in cases where a region restriction
  is also being applied to the file, as this would require the VCF to
  be tabix indexed)

* many: Users of linux bash can enable command and flag
  completion. See the file rtg-bash-completion in the scripts
  directory for more information.

* bgzip: New flag --no-terminate allows the omission the block gzip
  termination block. This permits advanced users to compress multiple
  files for later fast concatenation (the termination block should be
  present on the final file only).

* bgzip: New flag --compression-level allows altering the degree of
  compression (thus speed) from 1 (least but fast) to 9 (best but
  slow).

* rocplot: GUI mode has better error handling when there is no
  graphical environment.

* rocplot: PNG output mode will attempt to use headless mode to
  prevent an error when the graphical environment is unavailable.

* popsim: Speed improvements.

* readsim/cgsim: Added the --sam-rg flag to set the read group
  information to be stored in the output SDF. Removed --diploid-input
  as the recommended way to simulate diploid genomes is to use
  samplereplay or the --output-sdf option of
  samplesim/childsim/denovosim.

* readsimeval: New command for evaluating the accuracy of mapping reads
  generated by readsim.

* pedfilter: New command for pedigree file filtering and simple
  manipulation and conversion between pedigree PED files and
  pedigree-augmented VCF headers.

* pedstats: New command for extracting information and summarizing
  information contained in a pedigree file.

* aview: The flag --dont-display-dots has been renamed to
  --no-dots for consistency.

RTG Core 3.3.2 (2014-04-09)
---------------------------

This is a bugfix only release:

* Fix soft-clipping behaviour when using the table-based single-indel
  aligner.

RTG Core 3.3.1 (2013-12-06)
---------------------------

This is a bugfix only release:

* During variant calling with pedigrees, particulary complex
  situations were deferring to a new algorithm, but this had
  undesirable performance characteristics on very large
  pedigrees. This has been reverted until the peformance can be
  improved.

RTG Core 3.3 (2013-11-29)
-------------------------

Major features of this release:

* Speed improvements to family calling and population calling,
  particularly with large numbers of samples.

* Speed improvements to mapping as a result of a new table aligner
  (enabled for Illumina data by default).

* Mapping and variant calling have been improved to allow variant
  calling out to 50bp indels by default, in comparison to previous
  releases that defaulted to 9bp.  When using the new table aligner,
  there is a net improvement in mapping speed.  With the general
  aligner, mapping speed is impacted, but a full trade-off can be
  achieved via the aligner band width flag (see below).

* Pipeline streamlining.  SAM read group information can now be stored
  within an SDF at formatting time, and this information will
  automatically be used by subsequent mapping commands.  This has
  necessitated an increase in the SDF version, so old versions of RTG
  will not be able to read SDFs created by this version.  When variant
  calling exome datasets, the target region bed file can be supplied
  to automatically flag variants off target, saving an extra vcffilter
  step.

* Mapping and variant calling is now PAR aware.  If your reference SDF
  contains information about PAR regions (as described in the user
  manual), mapping will occur to only one instance of the PAR region,
  and during variant calling will automatically switch between
  haploid/diploid appropriately.

Detailed changes are listed below by area.  Please read these through
fully, as some command-line flags have changed, so updates to your
pipeline scripts may be required.


## Basic Formatting and Mapping

* cg2sdf: New flag --sam-rg allows the specification of a SAM read
  group to be stored in the resulting SDF.  Note that this means the
  SDF version has changed, so SDFs produced by this version of RTG
  will not be readable with earlier versions of RTG.

* format: New flag --sam-rg allows the specification of a SAM read
  group to be stored in the resulting SDF.  Note that this means the
  SDF version has changed, so SDFs produced by this version of RTG
  will not be readable with earlier versions of RTG.

* format: When formatting reads from BAM files, the read group
  information is automatically stored in the resulting SDF.  Only a
  single read group is permitted per SDF, so if the input contains
  multiple read groups you must either use the new flag
  --select-read-group to select only records belonging to the
  specified read group, or use --sam-rg to explicitly define a single
  read group that all records will be assigned to.

* format/map: Various improvements to handling of input reads stored
  in BAM files.  When reading input from BAM, records that have the
  "secondary alignment" SAM flag set are ignored (on the assumption
  that every read should have a single primary record).  Warnings will
  be produced if the same read has multiple primary alignments or if
  paired-end data does not have matching records for each read-end,
  along with a summary after formatting indicating how many cases were
  encountered.

* map: When mapping from SDF that contains a read group there is no
  need to specify --sam-rg, as it is picked up automatically.

* map/mapf: Reduced memory usage during mapping when mapping reads
  from SDF along with the --read-names option.

* map/mapf: Added new flag --unknowns-penalty to allow more explicit
  control over how Ns are scored during alignment. The default value
  is 5 (in comparison, the default mismatch penalty is 9).

* map/mapf: Removed --penalize-unknowns as this is now redundant due
  to --unknowns-penalty. If desired, equivalent behaviour can be
  obtained by supplying --unknowns-penalty with the same penalty as
  the mismatch penalty, or --unknowns-penalty=0 to not penalize
  unknowns. Note that regardless of the penalty, alignment CIGARs
  always indicate Ns as a mismatch.

* map/mapf: These commands have the ability to use a new aligner for
  faster alignment and better identification of longer indels.
  Setting the --aligner-mode=table explicitly enables the use of this
  aligner, and setting --aligner-mode=general explicitly uses the same
  aligner as previous versions of RTG.

* map: When mapping Illumina data (as determined by the PLATFORM field
  of the SAM read group supplied), and the --aligner-mode is set to
  it's default value of "auto", the new table aligner is employed.

* map/mapf: The mechanism for setting aligner band width has changed.
  The flag --aligner-band-width-factor has been replaced by the new
  flag --aligner-band-width which takes the length of indel that can
  reliably be detected as a fraction of the read-length. The new
  default is 0.5, so for 100bp reads the aligners will attempt to find
  50bp long insertions/deletions (there may be some cases where longer
  events are found). Increasing this factor will increase runtime, and
  decreasing this will reduce runtime. Roughly comparable behaviour
  and speed to the previous release can be obtained with
  --aligner-band-width=0.1. When changing the --aligner-band-width, it
  often makes sense to also adjust alignment score thresholds.

* map/cgmap: When performing sex-aware mapping, reads will only be
  mapped to one occurrence of PAR regions (that on the X
  chromosome). This requires that your reference SDF reference.txt
  contains specifications of the PAR regions (see the User Manual for
  the description of the reference.txt file).

* sam2bam: Specifying '-' as the output name will send output to
  standard out.

* sammerge: Specifying '-' as the output name will send output to
  standard out.

* sammerge: New flags --require-flags/--filter-flags that allow
  accepting or rejecting SAM records based on the settings of the
  FLAGS column of each record. For example, to reject all records
  marked as secondary alignments, use --filter-flags 256.

* sammerge: New flag --exclude-unplaced to filter out any alignment
  records that do not have an alignment position.

* samstats: Removed --penalize-unknowns, as this tool could not handle
  the variable penalties for Ns that have been introduced to mapping.


### Variant Calling

* all variant callers: Calls now include the GL field for increased
  compatibility with downstream tools such as Beagle and PolyMutt.

* all variant callers: new VCF FORMAT fields are included to aid AVR
  scoring: PPB detects whether an imbalance of properly paired reads
  is present, and PUR measures the ratio of placed unmapped reads
  (those where the mate has been uniquely mapped but the read itself
  was not mapped)

* all variant callers: These callers now perform PAR aware variant
  calling by default (Males will have PARs on X called as diploid, and
  on Y not called). There can be some edge effects up to a read length
  either side of a PAR boundary when complex variant calls are made
  spanning the boundary.

* all variant callers: For some complex calls the AR FORMAT annotation
  value would sometimes exceed the maximum value of 1.

* all variant callers: Calibration based calculation of average
  coverage (used for AVR predictor attributes and overcoverage level
  setting) is corrected for Ns in the reference.

* all variant callers: new --filter-bed flag for exome region filtering

* all variant callers: Chromosomes such as MT that are denoted in the
  reference.txt as "polyploid" are treated as haploid during variant
  calling. In previous releases only the snp command would call these
  chromosomes (the family and population commands would skip calling
  on these chromosomes). In this release the family and population
  commands call these chromosomes as haploid and when pedigree is
  present, maternal inheritance is assumed.

* population: New flag --pedigree-connectivity to give explicit
  control over whether calling is carried out for highly-connected vs
  sparsely connected pedigrees.  The default is to automatically
  choose the mode. See the user manual for more information on when
  different modes might be appropriate.

* population: Improved memory usage and startup time when running very
  large population runs.

* vcffilter: The --density-window flag was not correctly handling
  indels at the same coordinate as a SNP.

* vcffilter: Reinstated flag --remove-all-same-as-ref (which was
  incorrectly removed previously).

* vcffilter: New flags --max-combined-read-depth and
  --min-combined-read-depth for filtering on the combined read depth
  in the INFO column.

* vcffilter: New flag --clear-failed-samples for clearing the GT of
  failed samples instead of removing the entire variant line.

* vcffilter/vcfannotate: Specifying '-' as the input name will read
  from standard in, and similarly specifying '-' as the output name
  will send output to standard out.

* vcfmerge: Specifying '-' as the output name will send output to
  standard out.

* vcfstats: Fixed an exception when running on VCF files containing
  records with no GT field.

* vcfeval: New beta module for evaluating variant call sets (formerly
  known as snpsimeval).

* vcfeval: Now issues warnings when there are differences between the
  set of sequences contained in the reference vs baseline variant set
  vs called variant sets.

* vcfeval: Previously, extremely complicated situations could consume
  vast amounts of memory and eventually crash.  These now exit
  gracefully with a message about which region caused the problem.  It
  is currently up to the user to manually filter out the problematic
  region and rerun.

* vcfeval: The --sort-value parameter has been removed in favor of
  using the --vcf-score-field flag.

* rocplot: This is a new module for plotting ROC graphs, which can run
  both as an interactive GUI or generate a static PNG image.


### Other

* extract: New flag --header-only to output only the header for the
  file of interest.

* species: Includes genome length column (this column will only
  contain a value when the row corresponds directly a species rather
  than an inner taxonomy node, i.e. when the value in the reference
  column = Y)

* readsim: Allows simulation of PCR duplicates and chimeric reads.

* aview: Now has the ability to load BED tracks (e.g. particularly
  useful to display variant caller regions.bed.gz, or an exome target
  regions BED file).

* aview: When filenames for VCF and BED tracks are specified in the
  form FILE=TITLE, the title given will be used in the track display.



Previous releases:

RTG Core 3.2.2 (2013-09-17)
---------------------------

This is a bugfix only release:

* many: Commands that involved reading from multiple SAM files could
  produce a crash in some circumstances when mixing single-end and
  paired-end records.

* vcfsubset: Sample presence checking was over-stringent, by checking
  for SAMPLE header lines as well as a named sample column on the
  CHROM line. SAMPLE header lines are now not required.

* coverage: Fixed an exception that would occur when running with a
  smoothing parameter larger than 5000.

RTG Core 3.2.1 (2013-08-01)
---------------------------

This is a bugfix only release:

* coverage/species: Fixed a crash when trying to generate graphs for
  reports when running on a machine without X11 available.

* many: I/O exceptions raised during asynchronous file reading could
  sometimes cause a talkback instead of a graceful user error message,
  or rarely would fail to detect the exception altogether.

* vcffilter: Even when not specified, AR filtering was removing values
  greater than 1 if other per-sample filtering was being carried out.
  (These AR values greater than 1 are very rare and will be addressed
  with in a subsequent release)

* vcfsubset: Up-front checking of field names supplied by the user
  (FILTER/INFO/FORMAT) instead of causing an exception during later
  processing.

* vcfsubset: When the stripping of specified FORMAT subfields would
  result in a variant having no format sub-fields remaining, rather
  than outputting an invalid record, the whole record is removed (and
  a count of such records is output upon completion).


RTG Core 3.2 (2013-06-20)
-------------------------

IMPORTANT NOTES WHEN UPGRADING:

* As requested by many customers, the default alignment output by map
  commands is now a single BAM file, named 'alignments.bam', to
  further simplify subsequent processing without the need for an
  explicit merge.  Any scripts you may have will need to be updated
  before working with this new version.  The old behaviour can be
  obtained using the flags --sam and/or --no-merge.  See the user
  manual for more information on these flags.

* The pre-built AVR models have been revised and rebuilt, and now
  includes separate models for exome and WGS datasets.  As such, the
  models will have somewhat different characteristics compared to
  their analogs from 3.1.  Most notably the scores are likely to be
  spread over a wider range than before, so any cut off thresholds
  currently in use for filtering should be adjusted.  The three
  pre-built models are now 'illumina-exome.avr', 'illumina-wgs.avr',
  and 'alternate.avr'

Other major features of this release:

* Population priors can now be supplied in a more compact form,
  allowing for much faster processing when using population priors
  derived from a large number of samples.

* Significant speed improvements to family calling and population
  calling (particularly pedigrees involving many families).

* Many improvements to avrbuild for customers building their own
  models.  These include greatly reduced memory requirements and
  improved speed.  The models are also now self balancing so
  discrepancies between the amount of positive and negative training
  data should have less impact on the spread of scores produced.

Detailed changes are listed below by area.  Please read these through
fully, as some command-line flags have changed, so updates to your
pipeline scripts may be required.


### Basic Mapping

* map/cgmap/mapf: The default outputs of these commands have been
  changed.  Rather than outputting separate block-compressed SAM files
  for each of mated / unmated / unmapped reads, these commands now
  output a single merged alignments BAM file.  The implementation of
  this incurs a much smaller overhead than performing a separate
  post-mapping merge.  Two flags have been added that can be used to
  obtain the old behaviour: --sam will output SAM rather than BAM,
  and --no-merge will cause separate output files to be produced.

### Variant Calling

* snp/family/population: The ambiguity ratio VCF annotation was not
  being output for complex variant calls.

* snp/family/population: The representation of indels in output VCF
  now only includes the previous base when absolutely necessary
  (previously the previous base was include for all indels, in
  accordance with an earlier version of the VCF specification).

* snp/family/population: VCF records now include an annotation
  containing the allelic depth (AD) for each sample.

* snp/family/population: Calling with population priors (via
  --population-priors) will now extract allele counts from the INFO
  AC/AN fields if these are there, and only fall back to counting from
  per-sample GT fields if AC/AN are not present.  Processing a
  population priors VCF that contains only these INFO fields and no
  sample columns can be significantly faster to process.  Such a
  reduced VCF can be constructed by passing the original population
  priors VCF through rtg vcfannotate --fill-an-ac, followed by rtg
  vcfsubset --keep-info AC --keep-info AN --remove-samples. See the
  user manual for more information.

* snp/family/population: The ABP/SBP annotations for complex called
  variants were not being calculated correctly in all cases.

* family/population: Much faster computation of pedigree posteriors
  (for some cases involving many families the entire variant calling
  run is 5x faster)

* population: Complex call splitting was sometimes not correctly
  representing any disagreeing hypotheses (DH attribute). This has now
  been fixed.

* vcffilter: The --sample flag can be specified multiple times to
  require that any sample-specific filtering criteria apply to more
  than one sample.

* vcffilter: New flag --all-samples can be used to apply any
  sample-specific criteria to every sample in the VCF.

* vcffilter: New flag --non-snps-only.

* vcffilter: Renamed flag --snp-density-window to --density-window,
  since it acts on all variants, not just SNPs.

* vcffilter: Removed flag --remove-all-same-as-ref, as this can now be
  achieved with --remove-same-as-ref --all-samples

* vcffilter: New flags --min-denovo-score and --max-denovo-score help
  identify high quality de novo variants. See the user manual for more
  information.

* vcfannotate: Restructuring of the flags accepted by vcfannotate. To
  annotate with variant IDs contained in a BED file, use --bed-ids
  BEDFILE. To annotate from a BED file into the INFO field of the
  output VCF, use --bed-info.  To annotate with variant IDs contained
  in a VCF, use --vcf-ids VCFFILE.

* vcfannotate: New option --fill-an-ac to recompute AN and AC INFO
  fields based on GT fields present in the VCF.

* vcfsubset: New command for performing column-wise removals from a
  VCF file, such as removing samples, format sub-fields, info fields,
  etc.


RTG Core 3.1.2 (2013-05-02)
---------------------------

Changes in this release:

* vcffilter: Fixed a bug when filtering multi-sample VCF files by
  --min-avr-score or --max-avr score. The code was ignoring the
  --sample flag and always basing the filtering on the AVR score of
  the first sample.

* family: Added a new flag --pedigree to allow specifying the family
  in PED format instead of via --father/--mother/--son flags etc. The
  PED file must contain only a single nuclear family.


RTG Core 3.1.1 (2013-04-24)
---------------------------

This is a bugfix only release:

* map/cgmap: When using --bed-regions during mapping of exome data,
  information for sequences not referenced in the BED was not
  correctly initialized, leading to an error message in subsequent
  variant calling.

* avrbuild: A small improvement in the treatment of missing values
  during model building.

* population: Fix a problem with internal ids associated with
  families.


RTG Core 3.1 (2013-04-19)
-------------------------

Major features of this release:

* Improvements to alignments produced during mapping. The aligner
  penalties have changed to give better behaviour regarding indel
  detection. There are several additional user controls to give finer
  control over aligner behaviour.

* De novo variant calling in pedigree calling. Variants in offspring
  will automatically be marked in the output, and an additional score
  is produced that indicates the confidence that the variant is a true
  de novo variant.

* Adaptive Variant Rescoring (AVR) is a new capability that allows a
  scoring of variants that incorporates effects such as library prep
  or mapping artifacts that are not directly incorporated in the
  Bayesian variant modelling.  AVR incorporates machine learning
  models built from predictor attributes that empirically correlate
  with correctness with respect to a base variant set.  Several new
  attributes have been added to output VCFs to facilitate this.  This
  release includes some pre-built scoring models and provides tools to
  allow building models that may be better adapted to particular
  projects.

* Streamlined processing of exome datasets.  Mapping now directly
  supports the specification of an exome regions BED file to ensure
  variant calling includes appropriate calibration information for
  automatically determining over coverage situations.

Detailed changes are listed below by area.  Please read these through
fully, as some command-line flags have changed, so updates to your
pipeline scripts may be required.

### Basic Mapping

* format/map/mapf/mapx: Input FASTQ that is badly formed due to a
  mismatch between the sequence names given in the "@" vs the "+"
  sections of the record used to be silently skipped.  These records
  are now processed but warnings are issued that the FASTQ is badly
  formed.

* map: New flag --bed-regions that should be used during exome
  processing to ensure correct calibration information is available
  during variant calling. See the user manual for more information on
  exome processing workflows.

* map/mapf/cgmap: Several flags have their long names renamed to
  improve consistency with SAM specification terminology and for
  updated semantics in the presence of altered aligner penalties:

  --max-insert-size renamed to --max-fragment-size
  --min-insert-size renamed to --min-fragment-size
  --max-alignment-score renamed to --max-mismatches
  --max-mated-score renamed to --max-mated-mismatches 
  --max-unmated-score renamed to --max-unmated-mismatches

* map/mapf/cgmap: The various --max-*-mismatches flags have a slightly
  different interpretation with new aligner penalties. This now sets
  an alignment score threshold that would allow the specified number
  of mismatches.

* map/mapf: These commands have new flags that give more control over
  aligner penalties and indel detection capabilities:

  --mismatch-penalty. The penalty used when scoring a mismatch.
  --gap-open-penalty. The penalty used when scoring the opening of a
    gap.
  --gap-extend-penalty. The penalty used when scoring the extension of
    a gap.
  --soft-clip-distance. When using aligner penalties that favour
    detection of indels, often incorrect indels will be produced at
    the ends of reads. This flag specifies the distance from the ends
    of reads within which any indels will be soft-clipped.
  --aligner-band-width-factor.  Increasing this factor will allow
    longer indels to be aligned, at the expense of speed. A factor of
    1 gives room to find indels with length corresponding to
    --max-mismatches. A factor of 2 will double this length.

* map/mapf: The default aligner penalties have been updated to give
  better handling of indels in alignments.  To achieve scoring similar
  to previous versions, use --soft-clip-distance=0
  --substitution-penalty=1 --gap-open-penalty=1 --gap-extend-penalty=1

* coverage: Speed improvement when processing references containing
  large numbers of sequences.

* coverage: Automatically outputs an HTML report that graphically
  shows aggregate coverage levels.

* coverage: Fixed handling of zero length reference sequences.


### Variant Calling

* family/population: These commands automatically perform detection of
  de novo variants in children. This evaluation is represented by two
  new FORMAT attributes in the VCF. DN is a binary indicator as to de
  novo status, which will appear for any child samples in which one of
  the siblings contains a putative de novo variant (with value 'Y' for
  the putative de novo), and DNP contains the de novo posterior score.

* family/population: These commands implement phasing by descent on a
  site-by-site basis.  Wherever it can be unambiguously determined
  which parent contributed each allele, the child GT is phased so that
  the paternal allele is first, maternal allele second.

* population: Fixed a rare corner case where overlapping calls could
  be output when multiple families were present in the pedigree.

* snp/family/population: These commands now no longer hard filter on
  ambiguity ratio by default, the motivation being that this
  functionality should be subsumed by AVR.

* snp/family/population: These commands automatically outputs a simple
  HTML report containing summary information on variant calls.

* snp/family/population: There have been many changes to the
  annotations produced by default, primarily to provide predictor
  attributes for use with AVR.  See the user manual for the full
  descriptions of these new annotations.

* avrbuild: New command to build a machine learning model of variant
  accuracy from annotated training data. See the user manual for more
  details on AVR model building.

* avrpredict: New command to score variants using a machine learning
  model created by avrbuild.

* snp/family/population: These commands have a new flag --avr-model to
  supply a machine learning model that will be used to score variants
  during calling.  This release includes two pre-built models, and
  users may build your own using avrbuild.

* avrstats: New command to display simple information about an AVR model.

* mendelian: New flag --output to output all input variants (with
  mendelian violation information added as annotations) to a single
  output file.

* snpfilter: This previously beta command is now renamed to vcffilter.

* snpannotate: Renamed this beta command to vcfannotate.

* vcffilter: New flags --include-vcf and --exclude-vcf to include or
  exclude variants that overlap with those contained in the specified
  VCF file. Removed --allele-balance-variation flag (the allele
  balance annotation it operated on has been removed, superseded by a
  similar annotation capturing allele balance information).

* vcffilter: New flags --max-avr-score and --min-avr-score to allow
  filtering on the AVR annotations.

* vcffilter: New flag --remove-overlapping to discard any variants
  that overlap with previous variants.

* vcfmerge: When merging records at identical locations that contained
  different ALT alleles, any FORMAT attributes that used VCF number
  entries "A" or "G" (e.g. GL, PL, GP fields produced by other variant
  callers) would become invalid. vcfmerge will now leave those records
  unmerged in the resulting VCF if any A or G number attributes are
  present.

* snpsimeval: New beta module for performing very detailed concordance
  analysis of two variant call sets, capable of resolving differences
  in representation of equivalent calls.


### Other

* many: Fixed confusing reporting of out-of-disk space errors in
  multi-threaded situations.

* aview: New beta module for creating visual pileups in a terminal or
  HTML form.


RTG Core 3.0 (2013-02-15)
-------------------------

Below are the major changes in this release. Please read these through
fully, as some command-line flags and output file names have changed,
so updates to your pipeline scripts may be required.

### Metagenomics

* mapf: Added the flag --sam-rg. This allows you to specify the
  platform from which the reads originate (in particular, when the
  platform is IONTORRENT slightly different alignment parameters are
  used).

* similarity: This command now computes a principal component analysis
  on the similarity matrix and outputs this to a file named
  similarity.pca in the output directory. This analysis is to better
  allow sample clustering.

* species: Now calculates several species diversity metrics (species
  richness, Shannon diversity index, Pielou evenness index, and the
  inverse Simpson index) and outputs these to the summary.txt file.

* species: Computes upper and lower bounds for abundance estimates,
  along with a confidence value for each species. See the user manual
  for more information.

* species: Incorporates taxonomic information to allow clade-level
  abundance estimation. This requires the metagenomics reference
  database to contain taxonomic information. The easiest way is to
  obtain a metagenomics reference database from Real Time
  Genomics. See the user manual for more information.

* species: Removed the upper limit of 400 on the number of input SAM
  files.

* species: Removed --iterations flag, the termination criteria is now
  determined automatically.

* species: Added --min-confidence flag to specify the minimum
  confidence value for which a species/clade is reported in the
  output.

* species: Added --threads flag for setting number of threads. Species
  will now utilize multiple cores when possible, although for some
  input datasets there may be large portions of the run that only use
  a single core.

* species: Now automatically produces an HTML formatted report that
  allows interactive visualization of abundances (this feature is
  based on the Krona visualization tool).

* sdfstats: Added --taxonomy flag to allow outputting basic taxonomy
  information about a metagenomics reference SDF. This includes
  statistics such as the number of taxon nodes, the number of nodes
  with sequences, and the number of other nodes.

* Added three pipeline commands (composition-meta-pipeline,
  functional-meta-pipeline, composition-functional-meta-pipeline) to
  simplify the common use-cases of performing abundance and functional
  analysis starting directly from reads.  These commands internally
  call several RTG commands in sequence and output an HTML report
  containing summary information and links to primary output files.

### Basic mapping

* map/cgmap: These commands now automatically output an HTML report
  containing mapping summary statistics and graphs useful for QC.

* map/cgmap/mapf: Speed improvement (~5%) when mapping reads stored in
  an SDF due to more efficient SDF loading.

* mapping: Bugfixes to aligners to better handle some edge cases
  (heuristic aligners would occasionally prefer an alignment with a
  higher score).

* calibrate: Added a new flag --bed-regions to restrict the
  calibration calculations to the regions in the provided bed
  file. This option should be used to calibrate mappings of exome data
  in order for the calibration files to contain the correct depth of
  coverage information for supplying to variant calling. (Upcoming
  releases will allow the regions file to be supplied directly during
  mapping, avoiding the need for exome processing to have this extra
  step).

* coverage: Added a new flag --bed-regions to restrict the coverage
  calculations to the regions in the provided bed file. This allows
  coverage to be reported both per-exome region, and across all exome
  regions.

* coverage: Some of the contents of the summary file have moved to
  separate files. stats.tsv contains per-reference-sequence (or per
  region for exome analysis) coverage statistics. levels.tsv contains
  breakdowns of the proportion of the reference genome (or exome when
  appropriate) at the various coverage levels.

### Variant calling

* snp/family/population/somatic: The regions.bed file describing the
  types of processing occurring across the genome contains some new
  categories, and the values used in the name column of the bed file
  have been improved. See the user manual for more information.

* snp/family/population/somatic: The GQ values produced for each
  sample have been improved.

* snp: Removed a case where variant calls calls failing ambiguity
  filters could cause overlapping calls to be output.

* snpfilter: This tool can output to stdout when '-' is used as the
  the output file name.

* mendelian: Added the ability to read VCF directly from stdin when
  '-' is used as the input file name.

* vcfstats: Added the ability to read VCF directly from stdin when '-'
  is used as the input file name.

* vcfstats: --known and --novel flags added for calculating stats for
  known only or novel only variants, as determined by whether the VCF
  record has an identifier set (i.e.: column 3 of the record).

### Other

* many: We have added the ability to log the usage of RTG commands to
  a server. Depending on your license, it may be a requirement to have
  this enabled. See the user manual for more information.

* many: For consistency, several output file extensions containing
  tab-delimited data have been changed from txt to tsv where
  appropriate. This better corresponds to the file contents and allows
  appropriate "click-to-open" behavior for viewing tabular results in
  a spreadsheet application. Full listing below:

  mapx: Renamed primary output files (alignments.txt -> alignments.tsv 
  and unmapped.txt -> unmapped.tsv)

  map/cgmap: Read group statistics output file (rgstats.txt -> rgstats.tsv)

  sv: Primary output files (sv_simple.txt -> sv_simple.tsv and
  sv_bayesian.txt -> sv_bayesian.tsv)

  assemble: The graph output directories now use .tsv files instead of
  .txt files for the Path.N and header output files, but can still
  read the existing graph output directories.

  snpsimeval: All ROC output files now use .tsv extension.

* many: When specifying a --region using name:start+length notation,
  an extra base was being included.

* samrename: Added flags --no-gzip and --no-index for consistency with
  similar commands.

* samrename: Added flags --end-read and --start-read to reduce the
  memory requirements when renaming mapping outputs that also had been
  run on a subset of reads.

* readsim: When simulating a metagenomic sample, this beta tool now
  outputs a file containing the generated fractions for each input
  sequence.

* popsim: New beta tool to generate a VCF containing simulated
  population-level variants.

* samplesim: New beta tool to generate a simulated pedigree-free
  member of a population using variants defined in a VCF (such as that
  created by popsim).

* denovosim: New beta tool to simulate de novo variants within a
  genome.

* childsim: New beta tool to generate a simulated genotype as
  offspring of existing parent genotypes defined in a VCF.

* samplereplay: New beta tool to replay variants defined in a VCF into
  a reference genome, to be used with readsim.


RTG Investigator 2.7.5 (2013-02-08)
-----------------------------------

This is a bugfix only release:

* snp/family/population: Fix an infinite loop that could occur near
  the start of variant calling involving multiple read groups.


RTG Investigator 2.7.4 (2013-01-11)
-----------------------------------

This is a bugfix only release:

* many: SAM read group platform name checks are now case-insensitive.

* map: When mapping input reads contained in SAM files, the input no
  longer needs to be sorted by read name.


RTG Investigator 2.7.3 (2012-11-23)
-----------------------------------

This is a bugfix only release:

* snp/family/population: Fixed an exception if the VCF supplied as
  population priors contained a variant at the first base of a
  chromosome

* vcfmerge: This tool used to attempt to merge variants at the same
  reference position where the length of reference spanned differed
  between variants, by adding padding reference bases to the shorter
  variant. This is misleading, as it assumes the subsequent bases are
  non-variant. These variants are no longer merged.

* vcfmerge: Now outputs warnings when overlapping variants are
  encountered within a sample.

* snp/family/population: Fixed an exception that could occur when
  processing mappings with very long (>1Kb) indels/skips with an
  alignment.


RTG Investigator 2.7.2 (2012-10-23)
-----------------------------------

This is a bugfix only release:

* population: Fixed another corner case that could arise with the
  disagreeing hypotheses.


RTG Investigator 2.7.1 (2012-10-19)
-----------------------------------

This is a bugfix only release:

* population: The disagreeing hypothesis attribute (DH) was not
  formatted correctly, and could contain a ':' character which made
  the resulting VCF invalid. The first element of the DH attribute is
  now using GT-style notation (i.e. expressed using allele IDs rather
  than the allele bases themselves).


RTG Investigator 2.7 (2012-09-22)
---------------------------------

Changes in this release:

### Basic mapping

* cgmap: Removed the --read-names flag, as the Complete Genomics raw
  reads files do not contain read names.

* map/cgmap: Speed improvements when mapping very large read
  datasets.

* map/cgmap: The functionality of svprep has been integrated into
  mapping, to assist with more streamlined indel and structural
  variant discovery. This stage can be disabled by using the flag
  --no-svprep.

* map/cgmap/mapx: These mapping tools now produced unmapped reads
  output by default now. The flags -U/--report-unmapped from previous
  versions have been replaced by a new flag --no-unmapped to disable
  output of unmapped reads.

* coverage: The reference SDF is now optional. If not supplied, non-n
  coverage statistics will not be computed.

### Metagenomics

* species: Added the ability to supply unmapped reads SAM files, in
  which case it will compute a new statistic, which is the percentage
  of the entire sample that each species represents. See the user
  manual for more information.

* species: The species module requires coordinate-sorted mappings but
  was not checking whether input mappings were in fact coordinate
  sorted. This is now fixed.

### Variant calling

* all callers: Fixed an exception that could occur on short reference sequences
  containing few calls.

* all callers: Reduced memory usage, particularly when using --all mode.

* all callers: Improved speed when calling on very high coverage
  single-end data.

* all callers: The output VCF now includes a SAMPLE header line that
  indicates the sex of the sample when known (this is useful for
  downstream tools such as rtg mendelian).

* all callers: Population priors that contained a large number of
  genomes (e.g. 1000 genomes VCF) would use much more memory than
  required. Such very large files will still be very slow to parse
  though, and may even dominate variant caller runtime. An upcoming
  release will address this.

* all callers: Population priors are now also used to inform complex
  calling.

* all callers: The behaviour of the --max-coverage and
  --max-coverage-multiplier flags have changed.  Previously these
  adjusted both built-in thresholds used to skip complex calling in
  areas of very high coverage, and coverage filters applied to
  variants at output time. These are now used only to control the
  maximum depth of coverage (across all samples) for which calling is
  made.

* all callers: Output filtering on depth of coverage can still be
  achieved by using the new --filter-depth and
  --filter-depth-multiplier flags.  The default is to not apply any
  coverage filtering, however if you are running datasets with read
  lengths much less than 100 it may be beneficial to set this to
  approximately 3 times the average coverage or apply other post-call
  filtering. See the user manual for more information.

* all callers: the --max-ambiguity flag has been renamed to
  --filter-ambiguity to clarify that it is a filter applied at the
  output level.

* all callers: Fix an exception when input mappings contain QUAL
  scores higher than 63.

* population: Improved error handling when parsing of PED files.

* population: Hypothesis pruning is now turned on by default,
  previously it would experience performance problems in areas of
  extremely high coverage due to too many candidate hypotheses.

* population: Now utilizes family calling code when relationships in
  the input pedigree file indicates families are present. If you wish
  to disable relationship information during calling, simply supply a
  pedigree file that contains missing values (0) in the paternal id /
  maternal id columns. In cases where a sample is a member of multiple
  families (e.g. parent in one family and child in another), but the
  calls from each family disagree, these calls are annotated with the
  DH attribute (these are good candidates for de novo mutations).

* population: the embedded family calling code also supports partial
  families where one or more members are referenced in the pedigree
  file but have not been sequenced or mapped. By default, calls will
  only be produced for samples for which input mappings have been
  provided. There is a new option --impute to output the imputed
  genotype for a family member that has not been sequenced/mapped.
  The accuracy of these calls will be better for parents in families
  with many children.

* vcfstats: New flag --sample to output variant statistics for only
  the specified sample (the default behaviour is to calculate
  statistics for every sample in the VCF file).

* vcfstats: New flag --allele-lengths to output a histogram of variant
  lengths, broken out by variant type.

### Other

* all variant callers/coverage/cnv/sammerge/readsimeval: These modules
  contain a new flag, --min-mapq that allows filtering the input SAM
  to ignore all SAM records with MAPQ lower than the given value.

* all: Changed the wording in progress files final entry for failed
  runs to make failures more obvious and for easier script processing
  ("unsuccessful" -> "failed").

* mendelian: When outputting records containing Mendelian errors to a
  VCF file, the flag has been renamed from --output to
  --output-inconsistent. There is a new flag --output-consistent to
  output those records that do not contain mendelian errors.

* mendelian: The summary percentage statistics take the total records
  as only those for which some family members contain alternative
  alleles.

* mendelian: The --male and --female flags need not be supplied when
  checking a VCF file that contains sex information in the SAMPLE
  headers.

* mendelian: Family pedigree information can now alternatively be
  supplied via .PED pedigree file.

* snpfilter: New flag --remove-all-same-as-ref to remove variants
  where all samples are non-variant.

* bgzip: This command now accepts multiple file names for
  zipping/unzipping.

* snpsimeval: The flag -m/--mutations has been renamed to
  -b/--baseline.


RTG Investigator 2.6 (2012-05-22)
---------------------------------

Major features of this release:

* Improvements to mapping speed. Many of the components used during
  mapping have been examined, improved, and multi-threaded, in
  particular calibration, tabix index creation, handling of mapping
  direct from FASTQ/FASTQ/SAM/BAM. For our typical HiSeq mapping runs,
  elapsed time for a mapping run is approximately 30% faster.

* Reductions in memory use during mapping. These contribute to the
  speed improvements above but also mean that more reads can be mapped
  in a single run on a given compute node.

* Reductions in temporary disk footprint during mapping. Many
  temporary files encode there information more efficiently and are
  cleaned up as soon as they are no longer needed, and giving
  approximately 50% reduction in temporary disk usage.

* Significant speed and memory improvements to the coverage module, in
  some cases orders of magnitude faster and less memory.

* New module for performing population aware variant calling on
  multiple samples simultaneously. This can provide a significant
  improvement in accuracy over the single-genome variant caller,
  particularly when the per-sample mappings are relatively low
  coverage (below 15x)

Changes by command:

* format: Better error checking and improved output when formatting
  FASTA data.

* sdfsubseq: The region is now supplied as an anonymous flag
  (i.e. the "--region" or "-s" are no longer needed), and it is possible
  to supply multiple regions. This makes it easy to extract a FASTA
  file containing multiple regions extracted from a SDF.

* many: Tabix indexing of SAM/BAM output files is significantly
  faster.

* map/cgmap/calibrate: Calibration has been sped up significantly.

* map/mapf: Reduced memory usage during mapping, particularly when
  mapping direct from FASTA/FASTQ or using --read-names.

* map/cgmap: Reduced disk requirements for intermediate files during
  mapping by implementing more efficient temporary files and deleting
  temporary files as soon as they are no longer needed.

* map/cgmap: Multithreaded the tabix indexing of output SAM/BAM files.

* map/cgmap/mapf: Fix exception when mapping very large read sets with
  quality data.

* map/cgmap/mapf: Fix exception when determining percentage based
  repeat frequency cutoff for very large read sets.

* map/samrename: Illumina paired-end reads with /1, /2 arm-indicator
  suffix have these suffixes stripped in the output SAM/BAM QNAME
  field, as per the SAM specification.

* sammerge: Multithreaded handling for the case of multiple input
  files for improved throughput. Use the new --threads flag to
  customize the behaviour.

* coverage: Significantly (1 to 2 orders of magnitude) faster and more
  memory-efficient implementation.

* coverage: New option --keep-duplicates flag to disable the automatic
  detection and removal of optical/PCR duplicates during coverage
  calculation.

* coverage: The smoothing calculation now takes into account the
  reduced size of smoothing window at the edges of reference
  sequences.

* species: More robust parsing of the species relabel file.

* species: Now omits from the output any species that are determined
  to not occur in the sample. A comment line is included in the output
  that lists the number of omitted species.

* population: This is a new command that performs multi-sample
  population variant calling. See the user manual for more information
  and examples.

* snp/family/population/somatic: AB and AR genotype fields are now
  only output when the call has covering reads. There are rare cases
  when calls may be made with no covering reads (primarily when
  evidence for the call comes from other family or population members)

* snp/family/population/somatic: New flag --population-priors allows
  supplying an input VCF containing variants observed in the
  population, to be used as priors during calling. The input VCF must
  be tabix indexed.

* snp/family/population/somatic: Improved speed during complex calling
  (some runs are 25% faster).

* snp/family/population/somatic: Minor improvements to complex calling
  trigger conditions, particularly with higher coverage and in the
  presence of short repeats.

* somatic: Fixed an exception in areas of high coverage and high
  complexity (i.e. when many possible hypotheses are observed).

* vcfmerge: New flag --stats to output summary statistics
  corresponding to the contents of the output VCF.

* vcfmerge: The default behaviour is to refuse merging when fields are
  encountered with the same ID but differing descriptions, as the
  field semantics may be completely different. There is a new flag
  --force-merge allows such headers to be merged on a per field basis.

* svprep: This command now updates the unmapped reads SAM file if
  present to give the expected location of unmapped arms when the
  other arm is uniquely mapped. This can be used to assist analysis of
  split-reads and structural variants.

* discord: The --max-ambiguity and --max-coverage flags have been
  removed, as they were of dubious value and impeded development of
  other functionality.

* discord: Calls with predicted breakpoints falling outside the
  reference sequences have those predicted locations adjusted to
  comply with the specifications for the relevant output format.


RTG Investigator 2.5.2 (2012-05-08)
-----------------------------------

Changes in this release:

* many: Several data file format problems are now reported as a
  regular user message rather than causing a talkback.

* coverage: This command was flushing far too often, resulting in
  slower run times than 2.4. This has been fixed.

* cgmap: Better validation for input read data containing expected
  read lengths.

* mapx: The maximum word size of 12 is now checked during parameter
  validation.

* vcfmerge: Flag-type INFO values were not being passed through
  correctly.

* species: Fixed exception when input species contained 0-length
  genomes.

* family: Fixed an exception that could occur when calling on the Y
  chromosome.


RTG Investigator 2.5.1 (2012-03-14)
-----------------------------------

Changes in this release:

* map/mapf/cgmap: Fixed exception when mapping large (23Gb) input
  files directly from FASTA/FASTQ.

* map/mapf/cgmap: Doubled the limit on the number of reads that can be
  handled in one mapping run, assuming available memory, (to 2^31
  single end reads, 2^30 paired-end reads).

* snp/family/somatic: Fix exception when given input mappings with
  average coverage less than 1 on any reference sequence.

* many: Truncated bgzip files are now reported as a regular user
  message rather than causing a talkback.

* map/cgmap: Calibration is slow when you have large numbers of
  reference sequences. You may use the new --no-calibration flag to
  disable calibration if you do not require calibration data for
  subsequent variant calling.


RTG Investigator 2.5 (2012-03-06)
---------------------------------

Major features of this release:

* Multithreaded variant calling. While it was always possible to use
  the --region command to execute variant calling as separate jobs,
  the variant callers (snp/somatic/family) are now internally
  multithreaded for improved throughput.

* Improvements to variant calling accuracy. Primarily through better
  complex calling, automatic duplicate read detection, and
  improvements to calling near simple repeats.

* Improvements to somatic caller accuracy. In particular, the somatic
  caller now explicitly models loss of heterozygosity events and
  includes loss of heterozygosity information in VCF records where
  appropriate.

* Improved Ion Torrent support. This includes support for paired-end
  Ion Torrent data, and more accurate variant calling.

* Mapping commands may now directly output BAM as an alternative to
  block-compressed SAM.

* SDFs now store the full description line of sequence names from
  FASTA/FASTQ input files.  Where appropriate, commands use the full
  name in output (e.g. in many cases you won't need to supply a
  relabel file when running the species command with long species
  names). The SDF version has been incremented as a result of this
  change. RTG supports SDF backward compatibility but not forward
  compatibility, so versions of RTG prior to 2.5 will not be able to
  read these newer SDFs.

* Beta level commands for structural variant detection. Feedback on
  these commands is particularly welcome.


Changes by command:

* many: Output SAM/BAM now declare themselves as version 1.3 in the
  header.

* many: The TLEN/ISIZE field of SAM files is now calculated as the
  "observed template length" (as described in the SAM 1.3
  specification) rather than the "distance between 5' ends" (described
  in the SAM 1.2 spec).

* many: More robust parsing of the genome SDF reference.txt file that
  specifies sex / ploidy.

* many: Accept BAM indexes named <file>.bai, not just <file>.bam.bai.

* many: Performance improvements to tabix reading.

* many: The --no-tabix-index flag has been renamed to --no-index (and
  applies to both .tbi and .bai index production).

* wrapper: The rtg wrapper script was broken if there was a space in
  the path to your Java executable (this includes the case of using
  the bundled JRE when there was a space in the path to RTG
  installation directory).

* format: Support for formatting coordinate-sorted SAM/BAM read data.
  Note however that there may be speed issues when mapping such data
  though, as often all unmated / unmapped reads will be processed at
  the same time and these are more intensive to map than reads that
  can be properly mated.

* map/mapf/cgmap/mapx/sdfsplit: Fixed an implementation limitation on
  the number of bases in a read set that could be processed in a
  single run. The old limitation was at approximately 45Gnt (per arm
  for paired-end data), roughly 1.3B CG reads or 450M HiSeq reads,
  which has been removed.  Mapping commands still have an
  implementation limit on the number of reads that can be processed in
  a single mapping run (2^30 single end reads, 2^29 paired-end reads)
  which will be addressed in an upcoming release.

* map/cgmap: Fixed an exception that occurred on very large datasets
  (~500M reads) when the proportion of unmated reads was high.

* map/mapf/cgmap: Added the ability to output BAM rather than tabixed
  SAM (use the new flag --bam to enable this).

* map/mapf/cgmap: Added the ability to only mate reads that map in a
  particular orientation, to improve the ability to identify
  structural variants. For example, typical Illumina reads should have
  --orientation FR, Complete Genomics reads should have --orientation
  TANDEM. The default (ANY) does not enforce any particular
  orientation on mated reads.

* map/mapf/cgmap: Mapping against a reference that contains duplicated
  sequence names could cause an exception. This has been fixed.

* sam2bam/sammerge: When merging/converting files that have
  accompanying calibration files, a merged calibration file will
  automatically be created.

* sammerge: Flag --exclude-duplicates was not working correctly. Fixed. 

* sammerge: Added the flag --legacy-cigars to allow convert new-style
  (X/=) CIGARS to legacy cigars in the output.

* sam2bam/sammerge/coverage/snp/family/somatic: These commands no
  longer have a limit on the number of input SAM files.

* index: The format "snp" is no longer listed as an option (use "vcf"
  instead). The format "coverage" has been removed (use either "bed"
  or "coveragetsv" instead).

* coverage: New option --bedgraph to cause output to be written in
  bedgraph format rather than bed.  This allows coverage files to be
  directly viewed in tools such as IGV.

* snp/family/somatic: Fully multithreaded variant calling.

* snp/family/somatic: Now implements duplicate read detection. Use
  --keep-duplicates to disable this behaviour.

* snp/family/somatic: Now automatically loads mapping calibration
  files corresponding to each input SAM file (they may still be
  explicitly supplied if desired, for example if you typically move or
  rename SAM or calibration files after mapping). Use the new flag
  --no-calibration if you wish to disable calibration.

* snp/family/somatic: Improvements to complex calling

* snp/family/somatic: Improvements to calling near simple repeats
  (homopolymer/dinucleotide repeats)

* snp/family/somatic: Better error handling if specifying a --region
  with coordinates that exceed the bounds of the requested chromosome.

* snp/family/somatic: New flag --coverage-multiplier to adjust the
  thresholds used to detect over-coverage calls when calibration
  information is available. The default coverage multiplier is 2.0
  (i.e. variants where coverage is twice the average coverage over the
  entire sequence will be flagged as over-coverage).

* snp/family/somatic: Changed representation of variant calls that
  fail overcoverage filters (OC in the FILTER column, plus CT in the
  INFO specifies the actual threshold applied. See the user manual for
  more information).

* snp/family/somatic: Now outputs total depth (DP) field in VCF.

* snp/family/somatic: Utilizes RTG Complete Genomics extended cigar
  attributes (XQ/XR/XU) when present (this primarily affects
  non-complex calling, as complex calling performs its own
  probabilistic realignment).

* snp/family/somatic: Fix an exception when running variant calling
  against mappings containing read groups that have no platform
  specified.

* somatic: Now models loss of heterozygosity events and indicates in
  the VCF whether a variant represents a LOH event. See the user
  manual for more details.

* vcfstats: New command to output summary statistics for a VCF file.

* snpfilter: Changed the default behaviour so that if no flags are
  specified, no filtering is performed. Flags must be explicitly
  supplied corresponding to the desired filtering.

* species: Graceful termination in the case of running detection
  against a single species genome.

* species: Now outputs fractions as both a percentage of the mapped
  reads and as a percentage of the whole sample.  You should now
  supply the unmapped SAM files from mapping to allow the sample-level
  calculation to occur.

* species: The format of the relabel file has changed. See the user
  manual for more information.

* sdfsubset: Added --names to allow extracting sequences by name
  rather than sequence ID.


Beta commands:

* mendelian: This command detects variants that violate mendelian
  inheritance constraints. The records may be output directly to a VCF
  file.

* vcfmerge: This command combines separate input VCF files (either as a
  result of calling the same sample on separate chromosomes, or to
  combine individual sample VCF into a multi-sample VCF) into a single
  output VCF.

* svprep: Produces additional statistics in order to support the
  discordant read breakpoint tool. The argument specification has also
  changed slightly in that --input need not be supplied before the
  input mapping directory name.

* discord: New breakpoint finding tool based on clusters of discordant
  reads.


RTG Investigator 2.4.1 (2012-02-13)
-----------------------------------

This is a bug fix release only.

* cgmap: The --sex flag was not being correctly obeyed.

* sdf2fastq: Fix for incorrect sequence output from SDFs containing
  variable length reads.

* coverage: Fixed a case where 0 coverage could results in a NaN in
  the output file.


RTG Investigator 2.4 (2011-11-23)
---------------------------------

Major features of this release:

* mapx now has support for variable length and reads longer than
  189nt. Bear in mind that as mapx currently performs global
  alignment, longer reads will be less likely to have a high scoring
  match - you may need to adjust alignment thresholds appropriately.

* The snp module for calling SNPs, MNPs, and indels now supports
  haploid calling, and is faster (almost 2x faster for Complete
  Genomics data).

* End to end handling of sex chromosomes in human variant
  calling. After creating a one-off chromosome specification file for
  your reference genome, mapping and variant calling commands allow
  you to specify the sex of each individual being processed.

* Improved SNP calling accuracy for Ion Torrent, largely as a result
  of better handling large indels during initial mapping and
  realignment during variant calling.

* New somatic variant caller (licensees only). As with the singleton
  variant caller, this module is also able to utilize the chromosome
  specification file to automatically produce appropriate
  haploid/diploid calling on sex chromosomes.

* New pedigree-aware family variant caller (licensees only). This
  caller performs joint calling of all members of a family (mother,
  father, and any number of sons/daughters). This particularly
  improves the accuracy of variant calling when coverage of each
  individual is low. As with the singleton caller, this module is also
  able to utilize the chromosome specification file to automatically
  produce appropriate haploid/diploid calls on sex chromosomes.

Changes by command:

* family/somatic: These modules now implement complex calling
  resulting in improved accuracy.

* family: Now produces a QUAL score.

* mapx: Support for variable length read sets - previously read sets
  with more than a few nt deviation in length were not supported (if
  attempted, mapping performance would degrade with shorter
  reads). Variable length reads are now fully supported.

* mapx: Initial support for reads longer than 189nt.

* mapx: Handling of the --max-alignment-score for percentage based
  thresholds was incorrect in that it was calculated based on the
  pre-translated read length. This is now fixed and the flag
  description has been updated.

* snp: Improvements to Ion Torrent snp calling (as determined by the
  read group platform field being set to IONTORRENT).

* snp: Added new flag --ploidy to allow specifying whether to perform
  haploid or diploid variant calling.

* snp: Switched to new internal architecture to more readily allow
  multithreading. It no longer has a limit on the number of input SAM
  files, however now the input SAM files must be tabixed (or indexed
  BAM).

* snp: Fixed the handling of calling near boundaries of user-specified
  --region locations (previously mappings overlapping the region
  border were not being supplied to the snp caller).

* snp: CG snp calling speed is approximately 2x faster.

* map/snp/somatic/family: Added support for sex specific mapping and
  variant calling by defining a reference configuration file and using
  the appropriate --sex flag during mapping and snp calling. See the
  user manual for more details.

* sdfstats: New option --sex to list the reference sequences along
  with their ploidy for each sex.

* map/snp/somatic/family: Improvements have been made to the
  calibration files produced during mapping allow snp calling coverage
  filters to handle coverage variations per sequence (e.g. due to
  varying ploidy on sex chromosomes). You can generate new calibration
  files for existing mappings with the calibrate module.

* snp: New VCF filter RCEQUIV denotes when a variant is equivalent to
  a previous variant (these typically occur at either end of
  homopolymer regions).

* snp: New output file regions.bed.gz containing extra information
  regarding the calling. Currently it lists the regions that were
  called using complex calling.

* snp: QUAL scores for extremely confident calls were being capped at
  1000000, however this was also including all scores above about
  3000. QUAL scores are now more accurately output in the VCF.

* The .bz2 decompression library could not handle multi-member
  files. This has been extended to support these files.

* extract: Bug fix when extracting VCF/coverage from a file containing
  a single reference and no region was specified.

* extract: Bug fix for when the specified region contained an invalid
  range.

* all: Updated the bundled JVM to 1.6.0_29

* windows: Fixed a problem when RTG was installed to a location
  containing spaces in the path name.


RTG Investigator 2.3.2 (2011-10-06)
-----------------------------------

* format: Added the ability to perform base-quality read trimming,
  using BWA-style "best quality sum" length determination. Trimming
  low quality ends off reads can significantly improve the quality of
  Ion Torrent mappings. E.g. --trim-threshold 15.

* map/mapf: Improved mapping defaults for Ion Torrent data.

* map/mapf/mapx/format: Added the ability to accept input read data in
  SAM/BAM format, by supplying --format sam-se or --format sam-pe, for
  single or paired-end data respectively. The input SAM/BAM file must
  be sorted by query name.

* mapf: reduced memory usage, particularly with large numbers of
  reference sequences.

* mapx: add a warning when the selected parameters will result in a
  large number of indexes, and thus likely to give poor speed.

* coverage: fix an exception when encountering third-party SAM records
  with IH attribute set to 0 and NH greater than 0.

* sam2bam: this is a new module that specifically converts
  coordinate-sorted SAM to BAM.

* sammerge: updated the default behaviour to not perform filtering of
  records marked as unmapped or PCR duplicates (the flag
  --include-unmapped has been replaced by --exclude-unmapped, and the
  flag --include-duplicates has been replaced by --exclude-duplicates)

* sammerge: when the output file ends in ".bam", sammerge will produce
  BAM rather than SAM.


NOTE: When performing snp calling with --region on a partial
chromosome, you should currently enlarge your region by a read length
on each end to ensure all supporting evidence is seen near the
boundaries. This will be addressed in a subsequent release.


RTG Investigator 2.3.1 (2011-09-12)
-----------------------------------

* map/cgmap: SAM flag 0x100 (alignment is secondary) is now set for
  all non-uniquely mapped/mated records.

* map/cgmap: SAM flag 0x8 (mate is unmapped) in unmated and unmapped
  SAM files now indicates whether the mate is globally unmapped
  (however, mate position information is not available in these
  records). Previously this flag was always unset in order to avoid
  Picard warnings about not having position information supplied,
  however the SAM spec allows mate position to be unspecified and the
  information in the flag is useful nonetheless. These warnings will
  now be seen if you run the Picard validation tools.

* map: fix exception when using --top-random option.

* all: allow '=' in sequence names as long as it is not the first
  character.


RTG Investigator 2.3 (2011-08-31)
---------------------------------

* cgmap: switch to a new aligner implementation that produces better
  alignments and results in a 20-30% improvement in execution
  time. The SAM extended attributes GC/GS/GQ containing CG specific
  information have been replaced by more expressive attributes
  XU/XR/XQ. See the user manual for more details.

* map/snp: Initial Ion Torrent support. Specifying the IONTORRENT
  platform in the read group information during mapping will alter
  default alignment penalties and thresholds to better handle the Ion
  Torrent indels and will propagate through variant calling.

* snp: ambiguity ratio (AR) and allele balance (AB) have been added to
  FORMAT output in VCF. Calls that are made using the complex
  realigning caller are now indicated as such with an XRX annotation.

* snp: summary statistics have been updated to contain more useful
  information in a more readable presentation.

* snp: removed --output-second flag which was a hangover from a
  previous output format and did not affect the VCF produced.

* many commands: now support reading .bz2 compressed FASTA/FASTQ
  files.

* mapx: now supports direct loading of reads from FASTA/FASTQ.

* coverage/species: now includes sequence lengths in output.

* coverage: produces additional coverage information regarding non-N
  regions.

* map/mapf: performance and memory improvements when mapping against
  very large numbers of reference sequences.

* map/cgmap/mapf: the value supplied to the --sam-rg flag may now be
  either the name of a file containing the read group information, or
  a string containing the read group information itself (tabs must be
  represented by the sequence \t rather than literal tab characters,
  see the documentation for more information).

* sdfsplit: uses a disk-based SDF reader by default and have added the
  --in-memory flag to enable the older method (for faster processing
  if sufficient RAM is available).

* format: added the --allow-duplicate-names flag to disable the
  duplicate sequence name detection (this can save large amounts of
  memory when formatting extremely large datasets).

* sdfsplit: renamed the --disable-dupe-detection flag to
  --allow-duplicate-names for consistency with format.

* rtg wrapper script: rtg and the java that gets invoked now share the
  same unix process group so that signal handling works as expected
  within cluster scenarios.


RTG Investigator 2.2.1 (2011-07-14)
-----------------------------------

* mapx: fixed an overflow problem when the number of reads times the
  --max-top-results setting exceeded Integer.MAX_VALUE (2^31-1).

* rtg wrapper script: added safety checks for malformed cfg files (for
  example, it is easy to forget to include quotes when a property
  needs spaces). Also, the default rtg.cfg sets RTG_JAVA_OPTS to
  disable the JVM use of the popcount instruction until Oracle bug
  number 7063674 is fixed.

* many commands: included a workaround for a bug in gzip decompression
  that is present in many recent versions of the JRE. This allows us
  to include a no-JRE distributable, so we can now officially support
  MacOSX as a platform.

* EULA: permit investigators to use for evaluation; registration
  overview; non-competitive use only.

* snp/coverage: when supplying lists of SAM files via
  --input-list-file, the list files are now tolerant of extra
  white space surrounding the file names and empty lines. Lines starting
  with the hash character '#' are now treated as comments and are
  ignored.

* map/cgmap: RTG mated SAM files contain records in pairs, but in very
  heavy repeat regions this would occasionally be violated and the
  resulting SAM file would contain a SAM record for one arm but not
  the other. This is now fixed.

* map/cgmap/mapf: Fixed rare crash that could occur when running
  map/cgmap with --all-hits option, or mapf.


RTG Investigator 2.2 (2011-06-08)
---------------------------------

Initial public release.

NOTE: Non-deterministic mapping results have been observed on modern
      CPUs with Java versions 1.6.0_18 and newer due to a bug in the
      use of the popcount instruction. If your CPU implements SSE4
      instructions, we recommend adding -XX:-UsePopCountInstruction to
      the RTG_JAVA_OPTS configuration setting to work around this. We
      have filed a bug with Oracle regarding this
      (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7063674) but
      there is currently no resolution.