Raw contact to .hic → juicer_tools
软件地址:https://github.com/aidenlab/juicer/wiki/Download
使用说明:https://github.com/aidenlab/juicer/wiki/Pre
软件介绍
软件安装
wget https://s3.amazonaws.com/hicfiles.tc4ga.com/public/juicer/juicer_tools_1.22.01.jar
软件使用
pre
命令用于将 text file (<infile>
) 转化为不同resolution下的.hic file(<outfile>
)
.hic格式详见:https://www.cell.com/cell-systems/fulltext/S2405-4712(16)30219-8
默认的resolution包括:2.5M, 1M, 500K, 250K, 100K, 50K, 25K, 10K, and 5K,或者可以通过-r
参数指定
如果没有使用-n
选项,默认输出的hic file中已包括VC、VC_SQRT、KR和SCALE normalization 结果
Usage: juicer_tools pre [options] <infile> <outfile> <genomeID>
: -d only calculate intra chromosome (diagonal) [false]
: -f <restriction site file> calculate fragment map
: -m <int> only write cells with count above threshold m [0]
: -q <int> filter by MAPQ score greater than or equal to q [not set]
: -c <chromosome ID> only calculate map on specific chromosome [not set]
: -r <comma-separated list of resolutions> Only calculate specific resolutions [not set]
: -t <tmpDir> Set a temporary directory for writing
: -s <statistics file> Add the text statistics file to the Hi-C file header
: -g <graphs file> Add the text graphs file to the Hi-C file header
: -n Don't normalize the matrices
: -z <double> scale factor for hic file
: -a <1, 2, 3, 4, 5> filter based on inner, outer, left-left, right-right, tandem pairs respectively
: --randomize_position randomize positions between fragment sites
: --random_seed <long> for seeding random number generator
: --frag_site_maps <fragment site files> for randomization
: -k normalizations to include
: -j number of CPU threads to use
: --threads <int> number of threads
: --mndindex <filepath> to mnd chr block indices
Input 格式
short format
- 包含8列:<str1> <chr1> <pos1> <frag1> <str2> <chr2> <pos2> <frag2>
说明:- str:strand(0 for forward, anything else for reverse; 目前.hic file中不存储链信息)
- frag:restriction site fragment ( juicer_tool pre 会自动丢弃掉Map到相同restriction fragment的read,因此当没有fragment信息的时候,推荐设定frag1为0, frag2为1>
此外,数据还需要满足
- chr1 <= chr2
- 按chr1, chr2进行排序(即chr3-chr3的read必须在一起)
使用案例
原始数据格式: <seqID> <chr1> <pos1> <chr2> <pos2>
Step 1. Re-organization of raw data
将原始数据转化为short-format,并按染色体排序
cat ${raw_contact_file} | \
awk 'BEGIN{OFS="\t"}{print 0, $1, $2, 0, 1, $3, $4, 1}' |
sort -k2,2d -k6,6d \
> ${short_format_contact_file}
Step 2. From short-format txt to .hic
juice_tools=~/Softwares/juicer_tools_1.22.01.jar
infile=${short_format_contact_file}
outfile=${hic_file}
genomeID=mm10
java -Xmx2g -jar ${juicer_tool} pre ${infile} ${outfile} ${genomeID} --threads 4
Trouble-shooting Tips
Error: the chromosome combination 1_1 appears in multiple blocks
原因:read没有按照染色体进行排序
解决方案: sort -k2,2d -k6,6d (根据实际染色体所在列)Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
原因:JAVA内存不足
解决方案:调整-Xmx**
Raw contact to .cool → cooler
软件地址:https://github.com/open2c/cooler
使用说明:https://cooler.readthedocs.io/en/latest/cli.html#cooler-cload-pairs
软件安装:
pip install cooler
软件使用
cooler cload pair
命令用于将contact file转化为不同resolution下的.cool file
Usage: cooler cload pairs [OPTIONS] BINS PAIRS_PATH COOL_PATH
Bin any text file or stream of pairs.
Pairs data need not be sorted. Accepts compressed files. To pipe input from
stdin, set PAIRS_PATH to '-'.
BINS : One of the following
<TEXT:INTEGER> : 1. Path to a chromsizes file, 2. Bin size in bp
<TEXT> : Path to BED file defining the genomic bin segmentation.
PAIRS_PATH : Path to contacts (i.e. read pairs) file.
COOL_PATH : Output COOL file path or URI.
Options:
--metadata TEXT Path to JSON file containing user metadata.
--assembly TEXT Name of genome assembly (e.g. hg19, mm10)
-c1, --chrom1 INTEGER chrom1 field number (one-based) [required]
-p1, --pos1 INTEGER pos1 field number (one-based) [required]
-c2, --chrom2 INTEGER chrom2 field number (one-based) [required]
-p2, --pos2 INTEGER pos2 field number (one-based) [required]
--chunksize INTEGER Number of input lines to load at a time
-0, --zero-based Positions are zero-based [default: False]
--comment-char TEXT Comment character that indicates lines to
ignore. [default: #]
-N, --no-symmetric-upper Create a complete square matrix without
implicit symmetry. This allows for distinct
upper- and lower-triangle values
--input-copy-status [unique|duplex]
Copy status of input data when using
symmetric-upper storage. | `unique`:
Incoming data comes from a unique half of a
symmetric map, regardless of how the
coordinates of a pair are ordered. `duplex`:
Incoming data contains upper- and lower-
triangle duplicates. All input records that
map to the lower triangle will be discarded!
| If you wish to treat lower- and upper-
triangle input data as distinct, use the
``--no-symmetric-upper`` option. [default:
unique]
--field TEXT Specify quantitative input fields to
aggregate into value columns using the
syntax ``--field <field-name>=<field-
number>``. Optionally, append ``:`` followed
by ``dtype=<dtype>`` to specify the data
type (e.g. float), and/or ``agg=<agg>`` to
specify an aggregation function different
from sum (e.g. mean). Field numbers are
1-based. Passing 'count' as the target name
will override the default behavior of
storing pair counts. Repeat the ``--field``
option for each additional field.
--temp-dir DIRECTORY Create temporary files in a specified
directory. Pass ``-`` to use the platform
default temp dir.
--no-delete-temp Do not delete temporary files when finished.
--max-merge INTEGER Maximum number of chunks to merge before
invoking recursive merging [default: 200]
--storage-options TEXT Options to modify the data filter pipeline.
Provide as a comma-separated list of key-
value pairs of the form 'k1=v1,k2=v2,...'.
See http://docs.h5py.org/en/stable/high/data
set.html#filter-pipeline for more details.
-h, --help Show this message and exit.
使用案例
contact -> 1kb .cool file
cooler cload pairs -c1 1 -p1 2 -c2 3 -p2 4 \
mm10.chrom.sizes:1000 \
129G1_chr19.contact.bedpe \
129G1_chr19.1000.cool
.cool to multi-resolution .mcool file
cooler zoomify 129G1_chr19.1000.cool
.hic to .mcool → hic2cool
软件地址:https://github.com/4dn-dcic/hic2cool
软件安装
pip install hic2cool
软件使用
hic2cool convert <infile> <outfile> -r <resolution> -p <nproc>
positional arguments:
infile hic input file path
outfile cooler output file path
optional arguments:
-h, --help show this help message and exit
-r RESOLUTION, --resolution RESOLUTION
integer bp resolution desired in cooler file. Setting to 0 (default) will use all resolutions. If all resolutions are
used, a multi-res .cool file will be created, which has a different hdf5 structure. See the README for more info
-p NPROC, --nproc NPROC
number of processes to use to parse hic file. default set to 1
-s, --silent if used, silence standard program output
-w, --warnings if used, print out non-critical WARNING messages, which are hidden by default. Silent mode takes precedence over this
使用案例
生成 multi-resolution .mcool file
hic2cool convert ${hic_file} ${mcool_file}
生成特定resolution下.cool file
hic2cool convert ${hic_file} ${cool_50kb_file} -r 50000