IntervalListTools (Picard)

A tool for performing various IntervalList manipulations

Summary

This tool offers multiple interval list file manipulation capabilities, including: sorting, merging, subtracting, padding, and other set-theoretic operations. The default action is to merge and sort the intervals provided in the INPUTs. Other options, e.g. interval subtraction, are controlled by the arguments.
Both IntervalList and VCF files are accepted as input. IntervalList should be denoted with the extension .interval_list, while a VCF must have one of .vcf, .vcf.gz, .bcf When VCF file is used as input, each variant is translated into an using its reference allele or the END INFO annotation (if present) to determine the extent of the interval. IntervalListTools can also "scatter" the resulting interval-list into many interval-files. This can be useful for creating multiple interval lists for scattering an analysis over.

Details

The IntervalList file format is designed to help the users avoid mixing references when supplying intervals and other genomic data to a single tool. A SAM style header must be present at the top of the file. After the header, the file then contains records, one per line in text format with the followingvalues tab-separated: - Sequence name (SN) - Start position (1-based) - End position (1-based, inclusive) - Strand (either + or -) - Interval name (ideally unique names for intervals) The coordinate system is 1-based, closed-ended so that the first base in a sequence has position 1, and both the start and the end positions are included in an interval. Example interval list file

@HD	VN:1.0
@SQ	SN:chr1	LN:501
@SQ	SN:chr2	LN:401
chr1	1	100	+	starts at the first base of the contig and covers 100 bases
chr2	100	100	+	interval with exactly one base

Usage Examples

1. Combine the intervals from two interval lists:

java -jar picard.jar IntervalListTools \
      ACTION=CONCAT \
      I=input.interval_list \
      I=input_2.interval_list \
      O=new.interval_list

2. Combine the intervals from two interval lists, sorting the resulting in list and merging overlapping and abutting intervals:

 java -jar picard.jar IntervalListTools \
       ACTION=CONCAT \
       SORT=true \
       UNIQUE=true \
       I=input.interval_list \
       I=input_2.interval_list \
       O=new.interval_list

3. Subtract the intervals in SECOND_INPUT from those in INPUT

 java -jar picard.jar IntervalListTools \
       ACTION=SUBTRACT \
       I=input.interval_list \
       SI=input_2.interval_list \
       O=new.interval_list

4. Find bases that are in either input1.interval_list or input2.interval_list, and also in input3.interval_list:

 java -jar picard.jar IntervalListTools \
       ACTION=INTERSECT \
       I=input1.interval_list \
       I=input2.interval_list \
       SI=input3.interval_list \
       O=new.interval_list

Category Intervals Manipulation

Overview

Performs various IntervalList manipulations.

Summary

This tool offers multiple interval list file manipulation capabilities, including: sorting, merging, subtracting, padding, and other set-theoretic operations. The default action is to merge and sort the intervals provided in the #INPUTs. Other options, e.g. interval subtraction, are controlled by the arguments.
Both IntervalList and VCF files are accepted as input. IntervalList should be denoted with the extension htsjdk.samtools.util.IOUtil#INTERVAL_LIST_FILE_EXTENSION, while a VCF must have one of htsjdk.samtools.util.IOUtil#VCF_FILE_EXTENSION, htsjdk.samtools.util.IOUtil#COMPRESSED_VCF_FILE_EXTENSION, htsjdk.samtools.util.IOUtil#BCF_FILE_EXTENSION. When VCF file is used as input, each variant is translated into an using its reference allele or the END INFO annotation (if present) to determine the extent of the interval. IntervalListTools can also "scatter" the resulting interval-list into many interval-files. This can be useful for creating multiple interval lists for scattering an analysis over.

Details

 
 Sequence name (SN)
 Start position (1-based)
 End position (1-based, end inclusive)
 Strand (either + or -)
 Interval name (ideally unique names for intervals)

The coordinate system is 1-based, closed-ended, so that the first base in a sequence is at position 1, and both the start and the end positions are included in an interval. For Example:

 \@HD	VN:1.0
 \@SQ	SN:chr1	LN:501
 \@SQ	SN:chr2	LN:401
 chr1	1	100	+	starts at the first base of the contig and covers 100 bases
 chr2	100	100	+	interval with exactly one base

Usage examples

1. Combine the intervals from two interval lists:

 java -jar picard.jar IntervalListTools \\
       ACTION=CONCAT \\
       I=input.interval_list \\
       I=input_2.interval_list \\
       O=new.interval_list

2. Combine the intervals from two interval lists, sorting the resulting in list and merging overlapping and abutting intervals:

 java -jar picard.jar IntervalListTools \\
       ACTION=CONCAT \\
       SORT=true \\
       UNIQUE=true \\
       I=input.interval_list \\
       I=input_2.interval_list \\
       O=new.interval_list

3. Subtract the intervals in SECOND_INPUT from those in INPUT:

 java -jar picard.jar IntervalListTools \\
       ACTION=SUBTRACT \\
       I=input.interval_list \\
       SI=input_2.interval_list \\
       O=new.interval_list

4. Find bases that are in either input1.interval_list or input2.interval_list, and also in input3.interval_list:

 java -jar picard.jar IntervalListTools \\
       ACTION=INTERSECT \\
       I=input1.interval_list \\
       I=input2.interval_list \\
       SI=input3.interval_list \\
       O=new.interval_list

IntervalListTools (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s)	Default value	Summary
Required Arguments
--INPUT -I	[]	One or more interval lists. If multiple interval lists are provided the output is theresult of merging the inputs. Supported formats are interval_list and VCF.
Optional Tool Arguments
--ACTION	CONCAT	Action to take on inputs.
--arguments_file	[]	read one or more arguments files and add them to the command line
--BREAK_BANDS_AT_MULTIPLES_OF -BRK	0	If set to a positive value will create a new interval list with the original intervals broken up at integer multiples of this value. Set to 0 to NOT break up intervals.
--COMMENT	[]	One or more lines of comment to add to the header of the output file (as @CO lines in the SAM header).
--help -h	false	display the help message
--INCLUDE_FILTERED	false	Whether to include filtered variants in the vcf when generating an interval list from vcf.
--INVERT	false	Produce the inverse list of intervals, that is, the regions in the genome that are not covered by any of the input intervals. Will merge abutting intervals first. Output will be sorted.
--OUTPUT -O	null	The output interval list file to write (if SCATTER_COUNT == 1) or the directory into which to write the scattered interval sub-directories (if SCATTER_COUNT > 1).
--PADDING	0	The amount to pad each end of the intervals by before other operations are undertaken. Negative numbers are allowed and indicate intervals should be shrunk. Resulting intervals < 0 bases long will be removed. Padding is applied to the interval lists (both INPUT and SECOND_INPUT, if provided) before the ACTION is performed.
--SCATTER_COUNT	1	The number of files into which to scatter the resulting list by locus; in some situations, fewer intervals may be emitted. Note - if > 1, the resultant scattered intervals will be sorted and uniqued. The sort will be inverted if the INVERT flag is set.
--SECOND_INPUT -SI	[]	Second set of intervals for SUBTRACT and DIFFERENCE operations.
--SORT	true	If true, sort the resulting interval list by coordinate.
--SUBDIVISION_MODE -M	INTERVAL_SUBDIVISION	Selects between various ways in which scattering of the interval-list can happen.
--UNIQUE	false	If true, merge overlapping and adjacent intervals to create a list of unique intervals. Implies SORT=true.
--version	false	display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL	5	Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX	false	Whether to create a BAM index when writing a coordinate-sorted BAM file.
--CREATE_MD5_FILE	false	Whether to create an MD5 digest for any BAM or FASTQ files created.
--GA4GH_CLIENT_SECRETS	client_secrets.json	Google Genomics API client_secrets.json file path.
--MAX_RECORDS_IN_RAM	500000	When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET	false	Whether to suppress job-summary info on System.err.
--REFERENCE_SEQUENCE -R	null	Reference sequence file.
--TMP_DIR	[]	One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER -use_jdk_deflater	false	Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER -use_jdk_inflater	false	Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY	STRICT	Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY	INFO	Control verbosity of logging.
Advanced Arguments
--showHidden	false	display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.

--ACTION / NA

Action to take on inputs.

The --ACTION argument is an enumerated type (Action), which can have one of the following values:

CONCAT
UNION
INTERSECT
SUBTRACT
SYMDIFF
OVERLAPS

Action CONCAT

--arguments_file / NA

read one or more arguments files and add them to the command line

List[File] []

--BREAK_BANDS_AT_MULTIPLES_OF / -BRK

If set to a positive value will create a new interval list with the original intervals broken up at integer multiples of this value. Set to 0 to NOT break up intervals.

int 0 [ [ -∞ ∞ ] ]

--COMMENT / NA

One or more lines of comment to add to the header of the output file (as @CO lines in the SAM header).

List[String] []

--COMPRESSION_LEVEL / NA

Compression level for all compressed files created (e.g. BAM and VCF).

int 5 [ [ -∞ ∞ ] ]

--CREATE_INDEX / NA

Whether to create a BAM index when writing a coordinate-sorted BAM file.

Boolean false

--CREATE_MD5_FILE / NA

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean false

--GA4GH_CLIENT_SECRETS / NA

Google Genomics API client_secrets.json file path.

String client_secrets.json

--help / -h

display the help message

boolean false

--INCLUDE_FILTERED / NA

Whether to include filtered variants in the vcf when generating an interval list from vcf.

boolean false

--INPUT / -I

One or more interval lists. If multiple interval lists are provided the output is theresult of merging the inputs. Supported formats are interval_list and VCF.

R List[File] []

--INVERT / NA

Produce the inverse list of intervals, that is, the regions in the genome that are
not
covered by any of the input intervals. Will merge abutting intervals first. Output will be sorted.

boolean false

--MAX_RECORDS_IN_RAM / NA

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer 500000 [ [ -∞ ∞ ] ]

--OUTPUT / -O

The output interval list file to write (if SCATTER_COUNT == 1) or the directory into which to write the scattered interval sub-directories (if SCATTER_COUNT > 1).

File null

--PADDING / NA

The amount to pad each end of the intervals by before other operations are undertaken. Negative numbers are allowed and indicate intervals should be shrunk. Resulting intervals < 0 bases long will be removed. Padding is applied to the interval lists (both INPUT and SECOND_INPUT, if provided) before the ACTION is performed.

int 0 [ [ -∞ ∞ ] ]

--QUIET / NA

Whether to suppress job-summary info on System.err.

Boolean false

--REFERENCE_SEQUENCE / -R

Reference sequence file.

File null

--SCATTER_COUNT / NA

The number of files into which to scatter the resulting list by locus; in some situations, fewer intervals may be emitted. Note - if > 1, the resultant scattered intervals will be sorted and uniqued. The sort will be inverted if the INVERT flag is set.

int 1 [ [ -∞ ∞ ] ]

--SECOND_INPUT / -SI

Second set of intervals for SUBTRACT and DIFFERENCE operations.

List[File] []

--showHidden / -showHidden

display hidden arguments

boolean false

--SORT / NA

If true, sort the resulting interval list by coordinate.

boolean true

--SUBDIVISION_MODE / -M

Selects between various ways in which scattering of the interval-list can happen.

The --SUBDIVISION_MODE argument is an enumerated type (Mode), which can have one of the following values:

INTERVAL_SUBDIVISION

A simple scatter approach in which all output intervals have size equal to the total base count of the source list divided by the scatter count (except, possibly, in the last interval list).

BALANCING_WITHOUT_INTERVAL_SUBDIVISION

A scatter approach that differs from {@link Mode#INTERVAL_SUBDIVISION} in a few ways:

No interval will be subdivided, and consequently, the requested {@link IntervalListTools#SCATTER_COUNT} is an upper bound of scatter count, not a guarantee of the number of {@link IntervalList}s that will be produced (e.g., if scatterCount = 10 but there is only one interval in the input, only 1 interval list will be emitted).
When an interval would otherwise be split, it is instead deferred to the next scatter list.
The "target width" of each scatter list may be wider than what is computed for {@link Mode#INTERVAL_SUBDIVISION}. Specifically, if the widest interval in the source interval list is larger than what would otherwise be the target width, that interval's width is used.

This approach produces more consistently-sized interval lists, which is one of the objectives of scattering.

BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW

A scatter approach that differs from {@link Mode#BALANCING_WITHOUT_INTERVAL_SUBDIVISION}. This approach tries to balance the number of bases in each interval list by estimating the remaining interval lists sizes. This is computed from the total number of unique bases and the bases we have consumed. This means that the interval list with the most number of unique bases is at most the ideal split length larger than the smallest interval list (unique number of bases).

Mode INTERVAL_SUBDIVISION

--TMP_DIR / NA

One or more directories with space available to be used by this program for temporary storage of working files

List[File] []

--UNIQUE / NA

If true, merge overlapping and adjacent intervals to create a list of unique intervals. Implies SORT=true.

boolean false

--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean false

--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean false

--VALIDATION_STRINGENCY / NA

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency STRICT

--VERBOSITY / NA

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel INFO

--version / NA

display the version number for this tool

boolean false

Return to top

GATK version 4.0.11.0 built at 23-11-2018 02:11:49.