Collect metrics to quantify single-base sequencing artifacts.
This tool examines two sources of sequencing errors associated with hybrid selection protocols. These errors are divided into two broad categories, pre-adapter and bait-bias. Pre-adapter errors can arise from laboratory manipulations of a nucleic acid sample e.g. shearing and occur prior to the ligation of adapters for PCR amplification (hence the name pre-adapter).
Bait-bias artifacts occur during or after the target selection step, and correlate with substitution rates that are 'biased', or higher for sites having one base on the reference/positive strand relative to sites having the complementary base on that strand. For example, during the target selection step, a (G>T) artifact might result in a higher substitution rate at sites with a G on the positive strand (and C on the negative), relative to sites with the flip (C positive)/(G negative). This is known as the 'G-Ref' artifact.
For additional information on these types of artifacts, please see the corresponding GATK dictionary entries on bait-bias and pre-adapter artifacts.
This tool produces four files; summary and detail metrics files for both pre-adapter and bait-bias artifacts. The detailed metrics show the error rates for each type of base substitution within every possible triplet base configuration. Error rates associated with these substitutions are Phred-scaled and provided as quality scores, the lower the value, the more likely it is that an alternate base call is due to an artifact. The summary metrics provide likelihood information on the 'worst-case' errors.
java -jar picard.jar CollectSequencingArtifactMetrics \Please see the metrics at the following links PreAdapterDetailMetrics, PreAdapterSummaryMetrics, BaitBiasDetailMetrics, and BaitBiasSummaryMetrics for complete descriptions of the output metrics produced by this tool.
I=input.bam \
O=artifact_metrics.txt \
R=reference_sequence.fasta
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--INPUT -I |
null | Input SAM or BAM file. | |
--OUTPUT -O |
null | File to write the output to. | |
--REFERENCE_SEQUENCE -R |
null | Reference sequence file. | |
Optional Tool Arguments | |||
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--ASSUME_SORTED -AS |
true | If true (default), then the sort order in the header file will be ignored. | |
--CONTEXT_SIZE |
1 | The number of context bases to include on each side of the assayed base. | |
--CONTEXTS_TO_PRINT |
[] | If specified, only print results for these contexts in the detail metrics output. However, the summary metrics output will still take all contexts into consideration. | |
--DB_SNP |
null | VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis. | |
--FILE_EXTENSION -EXT |
null | Append the given file extension to all metric file names (ex. OUTPUT.pre_adapter_summary_metrics.EXT). None if null | |
--help -h |
false | display the help message | |
--INCLUDE_DUPLICATES -DUPES |
false | Include duplicate reads. If set to true then all reads flagged as duplicates will be included as well. | |
--INCLUDE_NON_PF_READS -NON_PF |
false | Whether or not to include non-PF reads. | |
--INCLUDE_UNPAIRED -UNPAIRED |
false | Include unpaired reads. If set to true then all paired reads will be included as well - MINIMUM_INSERT_SIZE and MAXIMUM_INSERT_SIZE will be ignored. | |
--INTERVALS |
null | An optional list of intervals to restrict analysis to. | |
--MAXIMUM_INSERT_SIZE -MAX_INS |
600 | The maximum insert size for a read to be included in analysis. Set to 0 to have no maximum. | |
--MINIMUM_INSERT_SIZE -MIN_INS |
60 | The minimum insert size for a read to be included in analysis. | |
--MINIMUM_MAPPING_QUALITY -MQ |
30 | The minimum mapping quality score for a base to be included in analysis. | |
--MINIMUM_QUALITY_SCORE -Q |
20 | The minimum base quality score for a base to be included in analysis. | |
--STOP_AFTER |
0 | Stop after processing N reads, mainly for debugging. | |
--TANDEM_READS -TANDEM |
false | Set to true if mate pairs are being sequenced from the same strand, i.e. they're expected to face the same direction. | |
--USE_OQ |
true | When available, use original quality scores for filtering. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create a BAM index when writing a coordinate-sorted BAM file. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--TMP_DIR |
[] | One or more directories with space available to be used by this program for temporary storage of working files | |
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
read one or more arguments files and add them to the command line
List[File] []
If true (default), then the sort order in the header file will be ignored.
boolean true
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
The number of context bases to include on each side of the assayed base.
int 1 [ [ -∞ ∞ ] ]
If specified, only print results for these contexts in the detail metrics output. However, the summary metrics output will still take all contexts into consideration.
Set[String] []
Whether to create a BAM index when writing a coordinate-sorted BAM file.
Boolean false
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis.
File null
Append the given file extension to all metric file names (ex. OUTPUT.pre_adapter_summary_metrics.EXT). None if null
String null
Google Genomics API client_secrets.json file path.
String client_secrets.json
display the help message
boolean false
Include duplicate reads. If set to true then all reads flagged as duplicates will be included as well.
boolean false
Whether or not to include non-PF reads.
boolean false
Include unpaired reads. If set to true then all paired reads will be included as well - MINIMUM_INSERT_SIZE and MAXIMUM_INSERT_SIZE will be ignored.
boolean false
Input SAM or BAM file.
R File null
An optional list of intervals to restrict analysis to.
File null
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
The maximum insert size for a read to be included in analysis. Set to 0 to have no maximum.
int 600 [ [ -∞ ∞ ] ]
The minimum insert size for a read to be included in analysis.
int 60 [ [ -∞ ∞ ] ]
The minimum mapping quality score for a base to be included in analysis.
int 30 [ [ -∞ ∞ ] ]
The minimum base quality score for a base to be included in analysis.
int 20 [ [ -∞ ∞ ] ]
File to write the output to.
R File null
Whether to suppress job-summary info on System.err.
Boolean false
Reference sequence file.
R File null
display hidden arguments
boolean false
Stop after processing N reads, mainly for debugging.
long 0 [ [ -∞ ∞ ] ]
Set to true if mate pairs are being sequenced from the same strand, i.e. they're expected to face the same direction.
boolean false
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
When available, use original quality scores for filtering.
boolean true
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
ValidationStringency STRICT
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
LogLevel INFO
display the version number for this tool
boolean false
See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum
GATK version 4.0.11.0 built at 23-11-2018 02:11:49.