Appendix¶
RTG reference file format¶
Many RTG commands can make use of additional information about the
structure of a reference genome, such as expected ploidy, sex
chromosomes, location of PAR regions, etc. When appropriate, this
information may be stored inside a reference genome’s SDF directory in a
file called reference.txt
.
The format
command will automatically identify several common
reference genomes during formatting and will create a reference.txt
in the resulting SDF. However, for non-human reference genomes, or less
common human reference genomes, a pre-built reference configuration file
may not be available, and will need to be manually provided in order to
make use of RTG sex-aware pipeline features.
Several example reference.txt
files for different human reference
versions are included as part of the RTG distribution in the scripts
subdirectory, so for common reference versions it will suffice to copy
the appropriate example file into the formatted reference SDF with the
name reference.txt
, or use one of these example files as the basis
for your specific reference genome.
To see how a reference text file will be interpreted by the chromosomes
in an SDF for a given sex you can use the sdfstats
command with the
--sex
flag. For example:
$ rtg sdfstats --sex male /data/human/ref/hg19
Location : /data/human/ref/hg19
Parameters : format -o /data/human/ref/hg19 -I chromosomes.txt
SDF Version : 11
Type : DNA
Source : UNKNOWN
Paired arm : UNKNOWN
SDF-ID : b6318de1-8107-4b11-bdd9-fb8b6b34c5d0
Number of sequences : 25
Maximum length : 249250621
Minimum length : 16571
Sequence names : yes
N : 234350281
A : 844868045
C : 585017944
G : 585360436
T : 846097277
Total residues : 3095693983
Residue qualities : no
Sequences for sex=MALE:
chrM POLYPLOID circular 16571
chr1 DIPLOID linear 249250621
chr2 DIPLOID linear 243199373
chr3 DIPLOID linear 198022430
chr4 DIPLOID linear 191154276
chr5 DIPLOID linear 180915260
chr6 DIPLOID linear 171115067
chr7 DIPLOID linear 159138663
chr8 DIPLOID linear 146364022
chr9 DIPLOID linear 141213431
chr10 DIPLOID linear 135534747
chr11 DIPLOID linear 135006516
chr12 DIPLOID linear 133851895
chr13 DIPLOID linear 115169878
chr14 DIPLOID linear 107349540
chr15 DIPLOID linear 102531392
chr16 DIPLOID linear 90354753
chr17 DIPLOID linear 81195210
chr18 DIPLOID linear 78077248
chr19 DIPLOID linear 59128983
chr20 DIPLOID linear 63025520
chr21 DIPLOID linear 48129895
chr22 DIPLOID linear 51304566
chrX HAPLOID linear 155270560 ~=chrY
chrX:60001-2699520 chrY:10001-2649520
chrX:154931044-155260560 chrY:59034050-59363566
chrY HAPLOID linear 59373566 ~=chrX
chrX:60001-2699520 chrY:10001-2649520
chrX:154931044-155260560 chrY:59034050-59363566
The reference file is primarily intended for XY sex determination but should be able to handle ZW and X0 sex determination also.
The following describes the reference file text format in more detail.
The file contains lines with TAB separated fields describing the
properties of the chromosomes. Comments within the reference.txt
file
are preceded by the character #
. The first line of the file that is
not a comment or blank must be the version line.
version1
The remaining lines have the following common structure:
<sex> <line-type> <line-setting>...
The sex field is one of male
, female
or either
. The
line-type field is one of def
for default sequence settings, seq
for specific chromosomal sequence settings and dup
for defining
pseudo-autosomal regions. The line-setting fields are a variable
number of fields based on the line type given.
The default sequence settings line can only be specified with either
for the sex field, can only be specified once and must be specified if
there are not individual chromosome settings for all chromosomes and
other contigs. It is specified with the following structure:
either def <ploidy> <shape>
The ploidy field is one of diploid
, haploid
, polyploid
or
none
. The shape field is one of circular
or linear
.
The specific chromosome settings lines are similar to the default
chromosome settings lines. All the sex field options can be used,
however for any one chromosome you can only specify a single line for
either
or two lines for male
and female
. They are specified
with the following structure:
<sex> seq <chromosome-name> <ploidy> <shape> [allosome]
The ploidy and shape fields are the same as for the default chromosome settings line. The chromosome-name field is the name of the chromosome to which the line applies. The allosome field is optional and is used to specify the allosome pair of a haploid chromosome.
The pseudo-autosomal region settings line can be set with any of the sex field options and any number of the lines can be defined as necessary. It has the following format:
<sex> dup <region> <region>
The regions must be taken from two haploid chromosomes for a given sex,
have the same length and not go past the end of the chromosome. The
regions are given in the format <chromosome-name>:<start>-<end>
where
start and end are positions counting from one and the end is
non-inclusive.
An example for the HG19 human reference:
# Reference specification for hg19, see
# http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=184117983&chromInfoPage=
version 1
# Unless otherwise specified, assume diploid linear. Well-formed
# chromosomes should be explicitly listed separately so this
# applies primarily to unplaced contigs and decoy sequences
either def diploid linear
# List the autosomal chromosomes explicitly. These are used to help
# determine "normal" coverage levels during mapping and variant calling
either seq chr1 diploid linear
either seq chr2 diploid linear
either seq chr3 diploid linear
either seq chr4 diploid linear
either seq chr5 diploid linear
either seq chr6 diploid linear
either seq chr7 diploid linear
either seq chr8 diploid linear
either seq chr9 diploid linear
either seq chr10 diploid linear
either seq chr11 diploid linear
either seq chr12 diploid linear
either seq chr13 diploid linear
either seq chr14 diploid linear
either seq chr15 diploid linear
either seq chr16 diploid linear
either seq chr17 diploid linear
either seq chr18 diploid linear
either seq chr19 diploid linear
either seq chr20 diploid linear
either seq chr21 diploid linear
either seq chr22 diploid linear
# Define how the male and female get the X and Y chromosomes
male seq chrX haploid linear chrY
male seq chrY haploid linear chrX
female seq chrX diploid linear
female seq chrY none linear
#PAR1 pseudoautosomal region
male dup chrX:60001-2699520 chrY:10001-2649520
#PAR2 pseudoautosomal region
male dup chrX:154931044-155260560 chrY:59034050-59363566
# And the mitochondria
either seq chrM polyploid circular
As of the current version of the RTG software the following are the
effects of various settings in the reference.txt
file when processing
a sample with the matching sex.
A ploidy setting of none
will prevent reads from mapping to that
chromosome and any variant calling from being done in that chromosome.
A ploidy setting of diploid
, haploid
or polyploid
does not
currently affect the output of mapping.
A ploidy setting of diploid
will treat the chromosome as having two
distinct copies during variant calling, meaning that both homozygous and
heterozygous diploid genotypes may be called for the chromosome.
A ploidy setting of haploid
will treat the chromosome as having one
copy during variant calling, meaning that only haploid genotypes will be
called for the chromosome.
A ploidy setting of polyploid
will treat the chromosome as having one
copy during variant calling, meaning that only haploid genotypes will be
called for the chromosome. For variant calling with a pedigree, maternal
inheritance is assumed for polyploid sequences.
The shape of the chromosome does not currently affect the output of mapping or variant calling.
The allosome pairs do not currently affect the output of mapping or variant calling (but are used by simulated data generation commands).
The pseudo-autosomal regions will cause the second half of the region
pair to be skipped during mapping. During variant calling the first half
of the region pair will be called as diploid and the second half will
not have calls made for it. For the example reference.txt
provided
earlier this means that when mapping a male the X chromosome sections of
the pseudo-autosomal regions will be mapped to exclusively and for
variant calling the X chromosome sections will be called as diploid
while the Y chromosome sections will be skipped. There may be some edge
effects up to a read length either side of a pseudo-autosomal region
boundary.
Pedigree PED input file format¶
The PED file format is a white space (tab or space) delimited ASCII
file. Lines starting with #
are ignored. It has exactly six required
columns in the following order.
Column | Definition |
---|---|
Family ID | Alphanumeric ID of a family group. This field is ignored by RTG commands. |
Individual ID | Alphanumeric ID of an individual. This corresponds to the Sample ID specified in the read group of the individual (SM field). |
Paternal ID | Alphanumeric ID of the paternal parent for the individual. This corresponds to the Sample ID specified in the read group of the paternal parent (SM field). |
Maternal ID | Alphanumeric ID of the maternal parent for the individual. This corresponds to the Sample ID specified in the read group of the maternal parent (SM field). |
Sex | The sex of the individual specified as using 1 for male, 2 for female and any other number as unknown. |
Phenotype | The phenotype of the individual specified using -9 or 0 for unknown, 1 for unaffected and 2 for affected. |
Note
The PED format is based on the PED format defined by the PLINK project: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped
The value ‘0’ can be used as a missing value for Family ID, Paternal ID and Maternal ID.
The following is an example of what a PED file may look like.
# PED format pedigree
# fam-id ind-id pat-id mat-id sex phen
FAM01 NA19238 0 0 2 0
FAM01 NA19239 0 0 1 0
FAM01 NA19240 NA19239 NA19238 2 0
0 NA12878 0 0 2 0
When specifying a pedigree for the lineage
command, use either the
pat-id or mat-id as appropriate to the gender of the sample cell
lineage. The following is an example of what a cell lineage PED file may
look like.
# PED format pedigree
# fam-id ind-id pat-id mat-id sex phen
LIN BASE 0 0 2 0
LIN GENA 0 BASE 2 0
LIN GENB 0 BASE 2 0
LIN GENA-A 0 GENA 2 0
RTG includes commands such as pedfilter
and pedstats
for simple
viewing, filtering and conversion of pedigree files.
RTG commands using indexed input files¶
Several RTG commands require coordinate indexed input files to operate
and several more require them when the --region
or --bed-regions
parameter is used. The index files used are standard tabix or BAM index
files.
The RTG commands which produce the inputs used by these commands will by
default produce them with appropriate index files. To produce indexes
for files from third party sources or RTG command output where the
--no-index
or --no-gzip
parameters were set, use the RTG
bgzip
and index
commands.
RTG JavaScript filtering API¶
The vcffilter
command permits filtering VCF records via user-supplied
JavaScript expressions or scripts containing JavaScript functions that
operate on VCF records. The JavaScript environment has an API provided
that enables convenient access to components of a VCF record in order to
satisfy common use cases.
VCF record field access¶
This section describes the supported methods to access components of an individual VCF record. In the following descriptions, assume the input VCF contains the following excerpt (the full header has been omitted):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12877 NA12878
1 11259340 . G C,T . PASS DP=795;DPR=0.581;ABC=4.5 GT:DP 1/2:65 1/0:15
CHROM, POS, ID, REF, QUAL¶
Within the context of a --keep-expr
or record
function these
variables will provide access to the String representation of the VCF
column of the same name.
CHROM; // "1"
POS; // "11259340"
REF; // "G"
ALT, FILTER¶
Will retrieve an array of the values in the column.
ALT; // ["C", "T"]
FILTER; // ["PASS"]
INFO.{INFO_FIELD}¶
The values in the INFO
field are accessible through properties on the
INFO
object indexed by INFO ID
. These properties will be the string
representation of info values with multiple values delimited with “,
”.
Missing fields will be represented by “.
”. Assigning to these
properties will update the VCF record. This will be undefined for fields not
declared in the header.
INFO.DP; // "795"
INFO.ABC; // "4,5"
INFO.DPR = "0.01"; // Will change the value of the DPR info field
{SAMPLE_NAME}.{FORMAT_FIELD}¶
The JavaScript String prototype has been extended to allow access to the
format fields for each sample. The string representation of values in
the sample column are accessible as properties on the string matching
the sample name named after the FORMAT
field ID
These properties can
be assigned in order to make modifications. This will be undefined for fields
not declared in the header.
'NA12877'.GT; // "1/2"
'NA12878'.GT; // "1/0"
'NA12877'.DP = "10"; // Will change the DP field of the NA12877 sample
VCF header modification¶
Functions are provided that allow the addition of new INFO
or FORMAT
fields to the header and records. It is recommended that the following
functions only be used within the run-once portion of --javascript
.
They may be called on every record, but this will be slow.
ensureFormatHeader(FORMAT_HEADER_STRING)¶
Add a new FORMAT
field to the VCF if it is not already present. This
will add a FORMAT
declaration line to the header and define the
corresponding accessor methods for use in record processing.
ensureFormatHeader('##FORMAT=<ID=GL,Number=G,Type=Float,' +
'Description="Log_10 scaled genotype likelihoods.">');
ensureInfoHeader(INFO_HEADER_STRING)¶
Add a new INFO
field to the VCF if it is not already present. This
will add an INFO
declaration line to the header and define the
corresponding accessor methods for use in record processing.
ensureInfoHeader('##INFO=<ID=CT,Number=1,Type=Integer,' +
'Description="Coverage threshold that was applied">');
Distribution Contents¶
The contents of the RTG distribution zip file should include:
- The RTG executable JAR file.
- RTG executable wrapper script.
- Example scripts and files.
- This operations manual.
- A release notes file and a readme file.
Some distributions also include an appropriate java runtime environment (JRE) for your operating system.
README.txt¶
For reference purposes, a copy of the distribution README.txt
file
follows:
=== RTG.VERSION ===
RTG software from Real Time Genomics includes tools for the processing
and analysis of plant, animal and human sequence data from high
throughput sequencing systems. Product usage and administration is
described in the accompanying RTG Operations Manual.
Quick Start Instructions
========================
RTG software is delivered as a command-line Java application accessed
via a wrapper script that allows a user to customize initial memory
allocation and other configuration options. It is recommended that
these wrapper scripts be used rather than directly accessing the Java
JAR.
For individual use, follow these quick start instructions.
No-JRE:
The no-JRE distribution does not include a Java Runtime Environment
and instead uses the system-installed Java. Ensure that at the
command line you can enter "java -version" and that this command
reports a java version of 1.7 or higher before proceeding with the
steps below. This may require setting your PATH environment variable
to include the location of an appropriate version of java.
Linux/MacOS X:
Unzip the RTG distribution to the desired location.
If your RTG distribution requires a license file (rtg-license.txt),
copy the license file from Real Time Genomics into the RTG
distribution directory.
In a terminal, cd to the installation directory and test for success
by entering "./rtg version"
On MacOS X, depending on your operating system version and
configuration regarding unsigned applications, you may encounter the
error message:
-bash: rtg: /usr/bin/env: bad interpreter: Operation not permitted
If this occurs, you must clear the OS X quarantine attribute with
the command:
xattr -d com.apple.quarantine rtg
The first time rtg is executed you will be prompted with some
questions to customize your installation. Follow the prompts.
Enter "./rtg help" for a list of rtg commands. Help for any individual
command is available using the --help flag, e.g.: "./rtg format --help"
By default, RTG software scripts establish a memory space of 90% of
the available RAM - this is automatically calculated. One may
override this limit in the rtg.cfg settings file or on a per-run
basis by supplying RTG_MEM as an environment variable or as the
first program argument, e.g.: "./rtg RTG_MEM=48g map"
[OPTIONAL] If you will be running rtg on multiple machines and would
like to customize settings on a per-machine basis, copy
rtg.cfg to /etc/rtg.cfg, editing per-machine settings
appropriately (requires root privileges). An alternative that does
not require root privileges is to copy rtg.example.cfg to
rtg.HOSTNAME.cfg, editing per-machine settings appropriately, where
HOSTNAME is the short host name output by the command "hostname -s"
Windows:
Unzip the RTG distribution to the desired location.
If your RTG distribution requires a license file (rtg-license.txt),
copy the license file from Real Time Genomics into the RTG
distribution directory.
Test for success by entering "rtg version" at the command line. The
first time rtg is executed you will be prompted with some
questions to customize your installation. Follow the prompts.
Enter "rtg help" for a list of rtg commands. Help for any individual
command is available using the --help flag, e.g.: "rtg format --help"
By default, RTG software scripts establish a memory space of 90% of
the available RAM - this is automatically calculated. One may
override this limit by setting the RTG_MEM variable in the rtg.bat
script or as an environment variable.
The scripts subdirectory contains demos, helper scripts, and example
configuration files, and comprehensive documentation is contained in
the RTG Operations Manual.
Using the above quick start installation steps, an individual can
execute RTG software in a remote computing environment without the
need to establish root privileges. Include the necessary data files
in directories within the workspace and upload the entire workspace to
the remote system (either stand-alone or cluster).
For data center deployment and instructions for editing scripts,
please consult the Administration chapter of the RTG Operations Manual.
A discussion group is now available for general questions, tips, and other
discussions. It may be viewed or joined at:
https://groups.google.com/a/realtimegenomics.com/forum/#!forum/rtg-users
To be informed of new software releases, subscribe to the low-traffic
rtg-announce group at:
https://groups.google.com/a/realtimegenomics.com/forum/#!forum/rtg-announce
Citing RTG
==========
John G. Cleary, Ross Braithwaite, Kurt Gaastra, Brian S. Hilbush,
Stuart Inglis, Sean A. Irvine, Alan Jackson, Richard Littin, Sahar
Nohzadeh-Malakshah, Mehul Rathod, David Ware, Len Trigg, and Francisco
M. De La Vega. "Joint Variant and De Novo Mutation Identification on
Pedigrees from High-Throughput Sequencing Data." Journal of
Computational Biology. June 2014, 21(6):
405-419. doi:10.1089/cmb.2014.0029.
Terms of Use
============
This proprietary software program is the property of Real Time
Genomics. All use of this software program is subject to the
terms of an applicable end user license agreement.
Patents
=======
US: 7,640,256, 13/129,329, 13/681,046, 13/681,215, 13/848,653,
13/925,704, 14/015,295, 13/971,654, 13/971,630, 14/564,810
UK: 1222923.3, 1222921.7, 1304502.6, 1311209.9, 1314888.7, 1314908.3
New Zealand: 626777, 626783, 615491, 614897, 614560
Australia: 2005255348, Singapore: 128254
Other patents pending
Third Party Software Used
=========================
RTG software uses the open source htsjdk library
(https://github.com/samtools/htsjdk) for reading and writing SAM
files, under the terms of following license:
The MIT License
Copyright (c) 2009 The Broad Institute
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
-------------------------
RTG software uses the bzip2 library included in the open source Ant project
(http://ant.apache.org/) for decompressing bzip2 format files, under the
following license:
Copyright 1999-2010 The Apache Software Foundation
Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed
under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-------------------------
RTG Software uses a modified version of
java/util/zip/GZIPInputStream.java (available in the accompanying
gzipfix.jar) from OpenJDK 7 under the terms of the following license:
This code is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License version 2 only, as
published by the Free Software Foundation. Oracle designates this
particular file as subject to the "Classpath" exception as provided
by Oracle in the LICENSE file that accompanied this code.
This code is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
version 2 for more details (a copy is included in the LICENSE file that
accompanied this code).
You should have received a copy of the GNU General Public License version
2 along with this work; if not, write to the Free Software Foundation,
Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
or visit http://www.oracle.com if you need additional information or have
any questions.
-------------------------
RTG Software uses hierarchical data visualization software from
http://sourceforge.net/projects/krona/ under the terms of the
following license:
Copyright (c) 2011, Battelle National Biodefense Institute (BNBI);
all rights reserved. Authored by: Brian Ondov, Nicholas Bergman, and
Adam Phillippy
This Software was prepared for the Department of Homeland Security
(DHS) by the Battelle National Biodefense Institute, LLC (BNBI) as
part of contract HSHQDC-07-C-00020 to manage and operate the National
Biodefense Analysis and Countermeasures Center (NBACC), a Federally
Funded Research and Development Center.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the Battelle National Biodefense Institute nor
the names of its contributors may be used to endorse or promote
products derived from this software without specific prior written
permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Notice¶
Real Time Genomics does not assume any liability arising out of the application or use of any software described herein. Further, Real Time Genomics does not convey any license under its patent, trademark, copyright, or common-law rights nor the similar rights of others.
Real Time Genomics reserves the right to make any changes in any processes, products, or parts thereof, described herein, without notice. While every effort has been made to make this guide as complete and accurate as possible as of the publication date, no warranty of fitness is implied.
© 2017 Real Time Genomics All rights reserved.
Illumina, Solexa, Complete Genomics, Ion Torrent, Roche, ABI, Life Technologies, and PacBio are registered trademarks and all other brands referenced in this document are the property of their respective owners.