CollectSequencingArtifactMetrics (Picard)

Collect metrics to quantify single-base sequencing artifacts.

This tool examines two sources of sequencing errors associated with hybrid selection protocols. These errors are divided into two broad categories, pre-adapter and bait-bias. Pre-adapter errors can arise from laboratory manipulations of a nucleic acid sample e.g. shearing and occur prior to the ligation of adapters for PCR amplification (hence the name pre-adapter).

Bait-bias artifacts occur during or after the target selection step, and correlate with substitution rates that are 'biased', or higher for sites having one base on the reference/positive strand relative to sites having the complementary base on that strand. For example, during the target selection step, a (G>T) artifact might result in a higher substitution rate at sites with a G on the positive strand (and C on the negative), relative to sites with the flip (C positive)/(G negative). This is known as the 'G-Ref' artifact.

For additional information on these types of artifacts, please see the corresponding GATK dictionary entries on bait-bias and pre-adapter artifacts.

This tool produces four files; summary and detail metrics files for both pre-adapter and bait-bias artifacts. The detailed metrics show the error rates for each type of base substitution within every possible triplet base configuration. Error rates associated with these substitutions are Phred-scaled and provided as quality scores, the lower the value, the more likely it is that an alternate base call is due to an artifact. The summary metrics provide likelihood information on the 'worst-case' errors.

Usage example:

java -jar picard.jar CollectSequencingArtifactMetrics \
     I=input.bam \
     O=artifact_metrics.txt \
     R=reference_sequence.fasta

Please see the metrics at the following links PreAdapterDetailMetrics, PreAdapterSummaryMetrics, BaitBiasDetailMetrics, and BaitBiasSummaryMetrics for complete descriptions of the output metrics produced by this tool.

Category Diagnostics and Quality Control

Overview

Quantify substitution errors caused by mismatched base pairings during various stages of sample / library prep. We measure two distinct error types - artifacts that are introduced before the addition of the read1/read2 adapters ("pre adapter") and those that are introduced after target selection ("bait bias"). For each of these, we provide summary metrics as well as detail metrics broken down by reference context (the ref bases surrounding the substitution event). For a deeper explanation, see Costello et al. 2013: http://www.ncbi.nlm.nih.gov/pubmed/23303777

CollectSequencingArtifactMetrics (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s)	Default value	Summary
Required Arguments
--INPUT -I	null	Input SAM or BAM file.
--OUTPUT -O	null	File to write the output to.
--REFERENCE_SEQUENCE -R	null	Reference sequence file.
Optional Tool Arguments
--arguments_file	[]	read one or more arguments files and add them to the command line
--ASSUME_SORTED -AS	true	If true (default), then the sort order in the header file will be ignored.
--CONTEXT_SIZE	1	The number of context bases to include on each side of the assayed base.
--CONTEXTS_TO_PRINT	[]	If specified, only print results for these contexts in the detail metrics output. However, the summary metrics output will still take all contexts into consideration.
--DB_SNP	null	VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis.
--FILE_EXTENSION -EXT	null	Append the given file extension to all metric file names (ex. OUTPUT.pre_adapter_summary_metrics.EXT). None if null
--help -h	false	display the help message
--INCLUDE_DUPLICATES -DUPES	false	Include duplicate reads. If set to true then all reads flagged as duplicates will be included as well.
--INCLUDE_NON_PF_READS -NON_PF	false	Whether or not to include non-PF reads.
--INCLUDE_UNPAIRED -UNPAIRED	false	Include unpaired reads. If set to true then all paired reads will be included as well - MINIMUM_INSERT_SIZE and MAXIMUM_INSERT_SIZE will be ignored.
--INTERVALS	null	An optional list of intervals to restrict analysis to.
--MAXIMUM_INSERT_SIZE -MAX_INS	600	The maximum insert size for a read to be included in analysis. Set to 0 to have no maximum.
--MINIMUM_INSERT_SIZE -MIN_INS	60	The minimum insert size for a read to be included in analysis.
--MINIMUM_MAPPING_QUALITY -MQ	30	The minimum mapping quality score for a base to be included in analysis.
--MINIMUM_QUALITY_SCORE -Q	20	The minimum base quality score for a base to be included in analysis.
--STOP_AFTER	0	Stop after processing N reads, mainly for debugging.
--TANDEM_READS -TANDEM	false	Set to true if mate pairs are being sequenced from the same strand, i.e. they're expected to face the same direction.
--USE_OQ	true	When available, use original quality scores for filtering.
--version	false	display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL	5	Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX	false	Whether to create a BAM index when writing a coordinate-sorted BAM file.
--CREATE_MD5_FILE	false	Whether to create an MD5 digest for any BAM or FASTQ files created.
--GA4GH_CLIENT_SECRETS	client_secrets.json	Google Genomics API client_secrets.json file path.
--MAX_RECORDS_IN_RAM	500000	When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET	false	Whether to suppress job-summary info on System.err.
--TMP_DIR	[]	One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER -use_jdk_deflater	false	Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER -use_jdk_inflater	false	Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY	STRICT	Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY	INFO	Control verbosity of logging.
Advanced Arguments
--showHidden	false	display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--arguments_file / NA

read one or more arguments files and add them to the command line

List[File] []

--ASSUME_SORTED / -AS

If true (default), then the sort order in the header file will be ignored.

boolean true

--COMPRESSION_LEVEL / NA

Compression level for all compressed files created (e.g. BAM and VCF).

int 5 [ [ -∞ ∞ ] ]

--CONTEXT_SIZE / NA

The number of context bases to include on each side of the assayed base.

int 1 [ [ -∞ ∞ ] ]

--CONTEXTS_TO_PRINT / NA

If specified, only print results for these contexts in the detail metrics output. However, the summary metrics output will still take all contexts into consideration.

Set[String] []

--CREATE_INDEX / NA

Whether to create a BAM index when writing a coordinate-sorted BAM file.

Boolean false

--CREATE_MD5_FILE / NA

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean false

--DB_SNP / NA

VCF format dbSNP file, used to exclude regions around known polymorphisms from analysis.

File null

--FILE_EXTENSION / -EXT

Append the given file extension to all metric file names (ex. OUTPUT.pre_adapter_summary_metrics.EXT). None if null

String null

--GA4GH_CLIENT_SECRETS / NA

Google Genomics API client_secrets.json file path.

String client_secrets.json

--help / -h

display the help message

boolean false

--INCLUDE_DUPLICATES / -DUPES

Include duplicate reads. If set to true then all reads flagged as duplicates will be included as well.

boolean false

--INCLUDE_NON_PF_READS / -NON_PF

Whether or not to include non-PF reads.

boolean false

--INCLUDE_UNPAIRED / -UNPAIRED

Include unpaired reads. If set to true then all paired reads will be included as well - MINIMUM_INSERT_SIZE and MAXIMUM_INSERT_SIZE will be ignored.

boolean false

--INPUT / -I

Input SAM or BAM file.

R File null

--INTERVALS / NA

An optional list of intervals to restrict analysis to.

File null

--MAX_RECORDS_IN_RAM / NA

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer 500000 [ [ -∞ ∞ ] ]

--MAXIMUM_INSERT_SIZE / -MAX_INS

The maximum insert size for a read to be included in analysis. Set to 0 to have no maximum.

int 600 [ [ -∞ ∞ ] ]

--MINIMUM_INSERT_SIZE / -MIN_INS

The minimum insert size for a read to be included in analysis.

int 60 [ [ -∞ ∞ ] ]

--MINIMUM_MAPPING_QUALITY / -MQ

The minimum mapping quality score for a base to be included in analysis.

int 30 [ [ -∞ ∞ ] ]

--MINIMUM_QUALITY_SCORE / -Q

The minimum base quality score for a base to be included in analysis.

int 20 [ [ -∞ ∞ ] ]

--OUTPUT / -O

File to write the output to.

R File null

--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean false

--REFERENCE_SEQUENCE / -R

Reference sequence file.

R File null

--showHidden / -showHidden

display hidden arguments

boolean false

--STOP_AFTER / NA

Stop after processing N reads, mainly for debugging.

long 0 [ [ -∞ ∞ ] ]

--TANDEM_READS / -TANDEM

Set to true if mate pairs are being sequenced from the same strand, i.e. they're expected to face the same direction.

boolean false

--TMP_DIR / NA

One or more directories with space available to be used by this program for temporary storage of working files

List[File] []

--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean false

--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean false

--USE_OQ / NA

When available, use original quality scores for filtering.

boolean true

--VALIDATION_STRINGENCY / NA

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency STRICT

--VERBOSITY / NA

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel INFO

--version / NA

display the version number for this tool

boolean false

Return to top

GATK version 4.0.11.0 built at 23-11-2018 02:11:49.