Identifies duplicate reads, accounting for mate CIGAR. This tool locates and tags duplicate reads (both PCR and optical) in a BAM or SAM file, where duplicate reads are defined as originating from the same original fragment of DNA, taking into account the CIGAR string of read mates.
It is intended as an improvement upon the original MarkDuplicates algorithm, from which it differs in several ways, includingdifferences in how it breaks ties. It may be the most effective duplicate marking program available, as it handles all cases including clipped and gapped alignments and locates duplicate molecules using mate cigar information. However, please note that it is not yet used in the Broad's production pipeline, so use it at your own risk.
Note also that this tool will not work with alignments that have large gaps or deletions, such as those from RNA-seq data. This is due to the need to buffer small genomic windows to ensure integrity of the duplicate marking, while large skips (ex. skipping introns) in the alignment records would force making that window very large, thus exhausting memory.
Note: Metrics labeled as percentages are actually expressed as fractions!
java -jar picard.jar MarkDuplicatesWithMateCigar \
I=input.bam \
O=mark_dups_w_mate_cig.bam \
M=mark_dups_w_mate_cig_metrics.txt
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
Argument name(s) | Default value | Summary | |
---|---|---|---|
Required Arguments | |||
--INPUT -I |
[] | One or more input SAM or BAM files to analyze. Must be coordinate sorted. | |
--METRICS_FILE -M |
null | File to write duplication metrics to | |
--OUTPUT -O |
null | The output file to write marked records to | |
Optional Tool Arguments | |||
--arguments_file |
[] | read one or more arguments files and add them to the command line | |
--ASSUME_SORT_ORDER -ASO |
null | If not null, assume that the input file has this order even if the header says otherwise. | |
--BLOCK_SIZE |
100000 | The block size for use in the coordinate-sorted record buffer. | |
--COMMENT -CO |
[] | Comment(s) to include in the output file's header. | |
--DUPLICATE_SCORING_STRATEGY -DS |
TOTAL_MAPPED_REFERENCE_LENGTH | The scoring strategy for choosing the non-duplicate among candidates. | |
--help -h |
false | display the help message | |
--MAX_OPTICAL_DUPLICATE_SET_SIZE |
300000 | This number is the maximum size of a set of duplicate reads for which we will attempt to determine which are optical duplicates. Please be aware that if you raise this value too high and do encounter a very large set of duplicate reads, it will severely affect the runtime of this tool. To completely disable this check, set the value to -1. | |
--MINIMUM_DISTANCE |
-1 | The minimum distance to buffer records to account for clipping on the 5' end of the records. For a given alignment, this parameter controls the width of the window to search for duplicates of that alignment. Due to 5' read clipping, duplicates do not necessarily have the same 5' alignment coordinates, so the algorithm needs to search around the neighborhood. For single end sequencing data, the neighborhood is only determined by the amount of clipping (assuming no split reads), thus setting MINIMUM_DISTANCE to twice the sequencing read length should be sufficient. For paired end sequencing, the neighborhood is also determined by the fragment insert size, so you may want to set MINIMUM_DISTANCE to something like twice the 99.5% percentile of the fragment insert size distribution (see CollectInsertSizeMetrics). Or you can set this number to -1 to use either a) twice the first read's read length, or b) 100, whichever is smaller. Note that the larger the window, the greater the RAM requirements, so you could run into performance limitations if you use a value that is unnecessarily large. | |
--OPTICAL_DUPLICATE_PIXEL_DISTANCE |
100 | The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best. | |
--PROGRAM_GROUP_COMMAND_LINE -PG_COMMAND |
null | Value of CL tag of PG record to be created. If not supplied the command line will be detected automatically. | |
--PROGRAM_GROUP_NAME -PG_NAME |
MarkDuplicatesWithMateCigar | Value of PN tag of PG record to be created. | |
--PROGRAM_GROUP_VERSION -PG_VERSION |
null | Value of VN tag of PG record to be created. If not specified, the version will be detected automatically. | |
--PROGRAM_RECORD_ID -PG |
MarkDuplicates | The program record ID for the @PG record(s) created by this program. Set to null to disable PG record creation. This string may have a suffix appended to avoid collision with other program record IDs. | |
--READ_NAME_REGEX |
MarkDuplicates can use the tile and cluster positions to estimate the rate of optical duplication in addition to the dominant source of duplication, PCR, to provide a more accurate estimation of library size. By default (with no READ_NAME_REGEX specified), MarkDuplicates will attempt to extract coordinates using a split on ':' (see Note below). Set READ_NAME_REGEX to 'null' to disable optical duplicate detection. Note that without optical duplicate counts, library size estimation will be less accurate. If the read name does not follow a standard Illumina colon-separation convention, but does contain tile and x,y coordinates, a regular expression can be specified to extract three variables: tile/region, x coordinate and y coordinate from a read name. The regular expression must contain three capture groups for the three variables, in order. It must match the entire read name. e.g. if field names were separated by semi-colon (';') this example regex could be specified (?:.*;)?([0-9]+)[^;]*;([0-9]+)[^;]*;([0-9]+)[^;]*$ Note that if no READ_NAME_REGEX is specified, the read name is split on ':'. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. | ||
--REMOVE_DUPLICATES |
false | If true do not write duplicates to the output file instead of writing them with appropriate flags set. | |
--SKIP_PAIRS_WITH_NO_MATE_CIGAR |
true | Skip record pairs with no mate cigar and include them in the output. | |
--version |
false | display the version number for this tool | |
Optional Common Arguments | |||
--ADD_PG_TAG_TO_READS |
true | Add PG tag to each read in a SAM or BAM | |
--COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
--CREATE_INDEX |
false | Whether to create a BAM index when writing a coordinate-sorted BAM file. | |
--CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
--GA4GH_CLIENT_SECRETS |
client_secrets.json | Google Genomics API client_secrets.json file path. | |
--MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
--QUIET |
false | Whether to suppress job-summary info on System.err. | |
--REFERENCE_SEQUENCE -R |
null | Reference sequence file. | |
--TMP_DIR |
[] | One or more directories with space available to be used by this program for temporary storage of working files | |
--USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
--USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
--VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
--VERBOSITY |
INFO | Control verbosity of logging. | |
Advanced Arguments | |||
--showHidden |
false | display hidden arguments | |
Deprecated Arguments | |||
--ASSUME_SORTED -AS |
false | If true, assume that the input file is coordinate sorted even if the header says otherwise. Deprecated, used ASSUME_SORT_ORDER=coordinate instead. |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Add PG tag to each read in a SAM or BAM
Boolean true
read one or more arguments files and add them to the command line
List[File] []
If not null, assume that the input file has this order even if the header says otherwise.
Exclusion: This argument cannot be used at the same time as ASSUME_SORTED
.
The --ASSUME_SORT_ORDER argument is an enumerated type (SortOrder), which can have one of the following values:
SortOrder null
If true, assume that the input file is coordinate sorted even if the header says otherwise. Deprecated, used ASSUME_SORT_ORDER=coordinate instead.
Exclusion: This argument cannot be used at the same time as ASSUME_SORT_ORDER, ASO
.
boolean false
The block size for use in the coordinate-sorted record buffer.
int 100000 [ [ -∞ ∞ ] ]
Comment(s) to include in the output file's header.
List[String] []
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
Whether to create a BAM index when writing a coordinate-sorted BAM file.
Boolean false
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
The scoring strategy for choosing the non-duplicate among candidates.
The --DUPLICATE_SCORING_STRATEGY argument is an enumerated type (ScoringStrategy), which can have one of the following values:
ScoringStrategy TOTAL_MAPPED_REFERENCE_LENGTH
Google Genomics API client_secrets.json file path.
String client_secrets.json
display the help message
boolean false
One or more input SAM or BAM files to analyze. Must be coordinate sorted.
R List[String] []
This number is the maximum size of a set of duplicate reads for which we will attempt to determine which are optical duplicates. Please be aware that if you raise this value too high and do encounter a very large set of duplicate reads, it will severely affect the runtime of this tool. To completely disable this check, set the value to -1.
long 300000 [ [ -∞ ∞ ] ]
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
File to write duplication metrics to
R File null
The minimum distance to buffer records to account for clipping on the 5' end of the records. For a given alignment, this parameter controls the width of the window to search for duplicates of that alignment. Due to 5' read clipping, duplicates do not necessarily have the same 5' alignment coordinates, so the algorithm needs to search around the neighborhood. For single end sequencing data, the neighborhood is only determined by the amount of clipping (assuming no split reads), thus setting MINIMUM_DISTANCE to twice the sequencing read length should be sufficient. For paired end sequencing, the neighborhood is also determined by the fragment insert size, so you may want to set MINIMUM_DISTANCE to something like twice the 99.5% percentile of the fragment insert size distribution (see CollectInsertSizeMetrics). Or you can set this number to -1 to use either a) twice the first read's read length, or b) 100, whichever is smaller. Note that the larger the window, the greater the RAM requirements, so you could run into performance limitations if you use a value that is unnecessarily large.
int -1 [ [ -∞ ∞ ] ]
The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best.
int 100 [ [ -∞ ∞ ] ]
The output file to write marked records to
R File null
Value of CL tag of PG record to be created. If not supplied the command line will be detected automatically.
String null
Value of PN tag of PG record to be created.
String MarkDuplicatesWithMateCigar
Value of VN tag of PG record to be created. If not specified, the version will be detected automatically.
String null
The program record ID for the @PG record(s) created by this program. Set to null to disable PG record creation. This string may have a suffix appended to avoid collision with other program record IDs.
String MarkDuplicates
Whether to suppress job-summary info on System.err.
Boolean false
MarkDuplicates can use the tile and cluster positions to estimate the rate of optical duplication in addition to the dominant source of duplication, PCR, to provide a more accurate estimation of library size. By default (with no READ_NAME_REGEX specified), MarkDuplicates will attempt to extract coordinates using a split on ':' (see Note below). Set READ_NAME_REGEX to 'null' to disable optical duplicate detection. Note that without optical duplicate counts, library size estimation will be less accurate. If the read name does not follow a standard Illumina colon-separation convention, but does contain tile and x,y coordinates, a regular expression can be specified to extract three variables: tile/region, x coordinate and y coordinate from a read name. The regular expression must contain three capture groups for the three variables, in order. It must match the entire read name. e.g. if field names were separated by semi-colon (';') this example regex could be specified (?:.*;)?([0-9]+)[^;]*;([0-9]+)[^;]*;([0-9]+)[^;]*$ Note that if no READ_NAME_REGEX is specified, the read name is split on ':'. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.
String
Reference sequence file.
File null
If true do not write duplicates to the output file instead of writing them with appropriate flags set.
boolean false
display hidden arguments
boolean false
Skip record pairs with no mate cigar and include them in the output.
boolean true
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
ValidationStringency STRICT
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
LogLevel INFO
display the version number for this tool
boolean false
See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum
GATK version 4.0.11.0 built at 23-11-2018 02:11:49.