EstimateLibraryComplexity (Picard)

Estimates the numbers of unique molecules in a sequencing library.

This tool outputs quality metrics for a sequencing library preparation.Library complexity refers to the number of unique DNA fragments present in a given library. Reductions in complexity resulting from PCR amplification during library preparation will ultimately compromise downstream analyses via an elevation in the number of duplicate reads. PCR-associated duplication artifacts can result from: inadequate amounts of starting material (genomic DNA, cDNA, etc.), losses during cleanups, and size selection issues. Duplicate reads can also arise from optical duplicates resulting from sequencing-machine optical sensor artifacts.

This tool attempts to estimate library complexity from sequence of read pairs alone. Reads are sorted by the first N bases (5 by default) of the first read and then the first N bases of the second read of a pair. Read pairs are considered to be duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default). Reads of poor quality are filtered out to provide a more accurate estimate. The filtering removes reads with any poor quality bases as defined by a read's MIN_MEAN_QUALITY (20 is the default value) across either the first or second read. Unpaired reads are ignored in this computation.

The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the calculation of library size. Also, since there is no alignment information used in this algorithm, an additional filter is applied to the data as follows. After examining all reads, a histogram is built in which the number of reads in a duplicate set is compared with the number of of duplicate sets. All bins that contain exactly one duplicate set are then removed from the histogram as outliers prior to the library size estimation.

Usage example:

java -jar picard.jar EstimateLibraryComplexity \
     I=input.bam \
     O=est_lib_complex_metrics.txt

Please see the documentation for the companion MarkDuplicates tool.

Category Diagnostics and Quality Control

Overview

Attempts to estimate library complexity from sequence alone. Does so by sorting all reads by the first N bases (5 by default) of each read and then comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default).

Reads of poor quality are filtered out so as to provide a more accurate estimate. The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than MIN_MEAN_QUALITY across either the first or second read.

The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the calculation of library size. Also, since there is no alignment to screen out technical reads one further filter is applied on the data. After examining all reads a Histogram is built of [#reads in duplicate set -> #of duplicate sets]; all bins that contain exactly one duplicate set are then removed from the Histogram as outliers before library size is estimated.

EstimateLibraryComplexity (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s)	Default value	Summary
Required Arguments
--INPUT -I	[]	One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped.
--OUTPUT -O	null	Output file to writes per-library metrics to.
Optional Tool Arguments
--arguments_file	[]	read one or more arguments files and add them to the command line
--BARCODE_TAG	null	Barcode SAM tag (ex. BC for 10X Genomics)
--help -h	false	display the help message
--MAX_DIFF_RATE	0.03	The maximum rate of differences between two reads to call them identical.
--MAX_GROUP_RATIO	500	Do not process self-similar groups that are this many times over the mean expected group size. I.e. if the input contains 10m read pairs and MIN_IDENTICAL_BASES is set to 5, then the mean expected group size would be approximately 10 reads.
--MAX_OPTICAL_DUPLICATE_SET_SIZE	300000	This number is the maximum size of a set of duplicate reads for which we will attempt to determine which are optical duplicates. Please be aware that if you raise this value too high and do encounter a very large set of duplicate reads, it will severely affect the runtime of this tool. To completely disable this check, set the value to -1.
--MAX_READ_LENGTH	0	The maximum number of bases to consider when comparing reads (0 means no maximum).
--MIN_GROUP_COUNT	2	Minimum number group count. On a per-library basis, we count the number of groups of duplicates that have a particular size. Omit from consideration any count that is less than this value. For example, if we see only one group of duplicates with size 500, we omit it from the metric calculations if MIN_GROUP_COUNT is set to two. Setting this to two may help remove technical artifacts from the library size calculation, for example, adapter dimers.
--MIN_IDENTICAL_BASES	5	The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU.
--MIN_MEAN_QUALITY	20	The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations.
--OPTICAL_DUPLICATE_PIXEL_DISTANCE	100	The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best.
--READ_NAME_REGEX		MarkDuplicates can use the tile and cluster positions to estimate the rate of optical duplication in addition to the dominant source of duplication, PCR, to provide a more accurate estimation of library size. By default (with no READ_NAME_REGEX specified), MarkDuplicates will attempt to extract coordinates using a split on ':' (see Note below). Set READ_NAME_REGEX to 'null' to disable optical duplicate detection. Note that without optical duplicate counts, library size estimation will be less accurate. If the read name does not follow a standard Illumina colon-separation convention, but does contain tile and x,y coordinates, a regular expression can be specified to extract three variables: tile/region, x coordinate and y coordinate from a read name. The regular expression must contain three capture groups for the three variables, in order. It must match the entire read name. e.g. if field names were separated by semi-colon (';') this example regex could be specified (?:.;)?([0-9]+)[^;];([0-9]+)[^;];([0-9]+)[^;]$ Note that if no READ_NAME_REGEX is specified, the read name is split on ':'. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.
--READ_ONE_BARCODE_TAG	null	Read one barcode SAM tag (ex. BX for 10X Genomics)
--READ_TWO_BARCODE_TAG	null	Read two barcode SAM tag (ex. BX for 10X Genomics)
--version	false	display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL	5	Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX	false	Whether to create a BAM index when writing a coordinate-sorted BAM file.
--CREATE_MD5_FILE	false	Whether to create an MD5 digest for any BAM or FASTQ files created.
--GA4GH_CLIENT_SECRETS	client_secrets.json	Google Genomics API client_secrets.json file path.
--MAX_RECORDS_IN_RAM	4052935	When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET	false	Whether to suppress job-summary info on System.err.
--REFERENCE_SEQUENCE -R	null	Reference sequence file.
--TMP_DIR	[]	One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER -use_jdk_deflater	false	Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER -use_jdk_inflater	false	Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY	STRICT	Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY	INFO	Control verbosity of logging.
Advanced Arguments
--showHidden	false	display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--arguments_file / NA

read one or more arguments files and add them to the command line

List[File] []

--BARCODE_TAG / NA

Barcode SAM tag (ex. BC for 10X Genomics)

String null

--COMPRESSION_LEVEL / NA

Compression level for all compressed files created (e.g. BAM and VCF).

int 5 [ [ -∞ ∞ ] ]

--CREATE_INDEX / NA

Whether to create a BAM index when writing a coordinate-sorted BAM file.

Boolean false

--CREATE_MD5_FILE / NA

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean false

--GA4GH_CLIENT_SECRETS / NA

Google Genomics API client_secrets.json file path.

String client_secrets.json

--help / -h

display the help message

boolean false

--INPUT / -I

One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped.

R List[File] []

--MAX_DIFF_RATE / NA

The maximum rate of differences between two reads to call them identical.

double 0.03 [ [ -∞ ∞ ] ]

--MAX_GROUP_RATIO / NA

Do not process self-similar groups that are this many times over the mean expected group size. I.e. if the input contains 10m read pairs and MIN_IDENTICAL_BASES is set to 5, then the mean expected group size would be approximately 10 reads.

int 500 [ [ -∞ ∞ ] ]

--MAX_OPTICAL_DUPLICATE_SET_SIZE / NA

This number is the maximum size of a set of duplicate reads for which we will attempt to determine which are optical duplicates. Please be aware that if you raise this value too high and do encounter a very large set of duplicate reads, it will severely affect the runtime of this tool. To completely disable this check, set the value to -1.

long 300000 [ [ -∞ ∞ ] ]

--MAX_READ_LENGTH / NA

The maximum number of bases to consider when comparing reads (0 means no maximum).

int 0 [ [ -∞ ∞ ] ]

--MAX_RECORDS_IN_RAM / NA

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer 4052935 [ [ -∞ ∞ ] ]

--MIN_GROUP_COUNT / NA

Minimum number group count. On a per-library basis, we count the number of groups of duplicates that have a particular size. Omit from consideration any count that is less than this value. For example, if we see only one group of duplicates with size 500, we omit it from the metric calculations if MIN_GROUP_COUNT is set to two. Setting this to two may help remove technical artifacts from the library size calculation, for example, adapter dimers.

int 2 [ [ -∞ ∞ ] ]

--MIN_IDENTICAL_BASES / NA

The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU.

int 5 [ [ -∞ ∞ ] ]

--MIN_MEAN_QUALITY / NA

The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations.

int 20 [ [ -∞ ∞ ] ]

--OPTICAL_DUPLICATE_PIXEL_DISTANCE / NA

The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best.

int 100 [ [ -∞ ∞ ] ]

--OUTPUT / -O

Output file to writes per-library metrics to.

R File null

--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean false

--READ_NAME_REGEX / NA

MarkDuplicates can use the tile and cluster positions to estimate the rate of optical duplication in addition to the dominant source of duplication, PCR, to provide a more accurate estimation of library size. By default (with no READ_NAME_REGEX specified), MarkDuplicates will attempt to extract coordinates using a split on ':' (see Note below). Set READ_NAME_REGEX to 'null' to disable optical duplicate detection. Note that without optical duplicate counts, library size estimation will be less accurate. If the read name does not follow a standard Illumina colon-separation convention, but does contain tile and x,y coordinates, a regular expression can be specified to extract three variables: tile/region, x coordinate and y coordinate from a read name. The regular expression must contain three capture groups for the three variables, in order. It must match the entire read name. e.g. if field names were separated by semi-colon (';') this example regex could be specified (?:.*;)?([0-9]+)[^;]*;([0-9]+)[^;]*;([0-9]+)[^;]*$ Note that if no READ_NAME_REGEX is specified, the read name is split on ':'. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.

String

--READ_ONE_BARCODE_TAG / NA

Read one barcode SAM tag (ex. BX for 10X Genomics)

String null

--READ_TWO_BARCODE_TAG / NA

Read two barcode SAM tag (ex. BX for 10X Genomics)

String null

--REFERENCE_SEQUENCE / -R

Reference sequence file.

File null

--showHidden / -showHidden

display hidden arguments

boolean false

--TMP_DIR / NA

One or more directories with space available to be used by this program for temporary storage of working files

List[File] []

--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean false

--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean false

--VALIDATION_STRINGENCY / NA

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency STRICT

--VERBOSITY / NA

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel INFO

--version / NA

display the version number for this tool

boolean false

Return to top

GATK version 4.0.11.0 built at 23-11-2018 02:11:49.