前言
Comparing SVs between two genome:use whole-genome alignment
1、MUMmer4
MUMmer is a system for rapidly aligning DNA and protein sequences.
## install
conda create -n mummer4 -c bioconda mummer4
## use nucmer 对齐(out out.delta)
nucmer --maxmatch --noextend -t 24 Q1ref.fa Q8ref.fa
## delta-filter 整理
delta-filter -1 -l 1000 out.delta > out.filter.delta
## show-coords
show-coords -THrd out.filter.delta > out.filter.coords
2.SyRi
安装
conda create -n syri python=3.5
conda activate syri
conda install cython numpy scipy pandas=0.23.4 biopython psutil matplotlib=3.0.0
conda install -c conda-forge python-igraph
conda install -c bioconda pysam
#use chroder
conda install -c bioconda longestrunsubsequence
git clone --single-branch --branch V1.4.1 https://github.com.cnpmjs.org/schneebergerlab/syri.git
cd syri
python3 setup.py install
chmod +x syri/bin/syri syri/bin/chroder syri/bin/plotsr
## syri
python3 syri_path --nosnp -c out.filtered.coords -d out.delta -r ref.fa -q query.fa
## syri plot
python3 syri_plotsr_plot syri.out ref.fa query.fa -H 16 -W 12 -o pdf
## chroder (非必要步骤,query genome没有挂载到染色体时,可用chrorder生成一个)
syri/bin/chroder out.filter.coords Q1ref.fa Q8ref.fa -o out.filter.chroder
参数说明
## syri
--nosnp Set to skip SNP/Indel (within alignment) identification (default: False)
## syri plot
-H 16 height of the plot
-W 12 width of the plot
-o pdf output file format(pdf,png,svg)
3、参数说明
3.1 nucmer参数
Usage: nucmer [options] ref:path qry:path+
nucmer generates nucleotide alignments between two mutli-FASTA input
files. The out.delta output file lists the distance between insertions
and deletions that produce maximal scoring alignments between each
sequence. The show-* utilities know how to read this format.
By default, nucmer uses anchor matches that are unique in in the
reference but not necessarily unique in the query. See --mum and
--maxmatch for different bevahiors.
Options (default value in (), *required):
--mum Use anchor matches that are unique in both the reference and query (false)
--maxmatch Use all anchor matches regardless of their uniqueness (false)(可能包含重复序列的结果)
-b, --breaklen=uint32 Set the distance an alignment extension will attempt to extend poor scoring regions before giving up (200)
-c, --mincluster=uint32 Sets the minimum length of a cluster of matches (65)
-D, --diagdiff=uint32 Set the maximum diagonal difference between two adjacent anchors in a cluster (5)
-d, --diagfactor=double Set the maximum diagonal difference between two adjacent anchors in a cluster as a differential fraction of the gap length (0.12)
--noextend Do not perform cluster extension step (false),设置是否对clustering两端进行延伸。默认设置--extend
-f, --forward Use only the forward strand of the Query sequences (false)
-g, --maxgap=uint32 Set the maximum gap between two adjacent matches in a cluster (90)
-l, --minmatch=uint32 Set the minimum length of a single exact match (20)
-L, --minalign=uint32 Minimum length of an alignment, after clustering and extension (0)
--nooptimize No alignment score optimization, i.e. if an alignment extension reaches the end of a sequence, it will not backtrack to optimize the alignment score and instead terminate the alignment at the end of the sequence (false)
-r, --reverse Use only the reverse complement of the Query sequences (false)
--nosimplify Don't simplify alignments by removing shadowed clusters. Use this option when aligning a sequence to itself to look for repeats (false)
-p, --prefix=PREFIX Write output to PREFIX.delta (out)
--delta=PATH Output delta file to PATH (instead of PREFIX.delta)
--sam-short=PATH Output SAM file to PATH, short format
--sam-long=PATH Output SAM file to PATH, long format
--save=PREFIX Save suffix array to files starting with PREFIX
--load=PREFIX Load suffix array from file starting with PREFIX
--batch=BASES Proceed by batch of chunks of BASES from the reference
-t, --threads=NUM Use NUM threads (# of cores)
-U, --usage Usage
-h, --help This message
--full-help Detailed help
-V, --version Version
nucmer is designed for DNA sequence alignment and proteing is designed for protein or translated sequence alignment
3.2 delta-filter 参数
USAGE: delta-filter [options] <deltafile>
-1 1-to-1 alignment allowing for rearrangements
(intersection of -r and -q alignments)
-g 1-to-1 global alignment not allowing rearrangements
-h Display help information
-i float Set the minimum alignment identity [0, 100], default 0
-l int Set the minimum alignment length, default 0
-m Many-to-many alignment allowing for rearrangements
(union of -r and -q alignments)
-q Maps each position of each query to its best hit in
the reference, allowing for reference overlaps
-r Maps each position of each reference to its best hit
in the query, allowing for query overlaps
-u float Set the minimum alignment uniqueness, i.e. percent of
the alignment matching to unique reference AND query
sequence [0, 100], default 0
-o float Set the maximum alignment overlap for -r and -q options
as a percent of the alignment length [0, 100], default 100
Reads a delta alignment file from either nucmer or promer and
filters the alignments based on the command-line switches, leaving
only the desired alignments which are output to stdout in the same
delta format as the input. For multiple switches, order of operations
is as follows: -i -l -u -q -r -g -m -1. If an alignment is excluded
by a preceding operation, it will be ignored by the succeeding
operations.
An important distinction between the -g option and the -1 and -m
options is that -g requires the alignments to be mutually consistent
in their order, while the -1 and -m options are not required to be
mutually consistent and therefore tolerate translocations,
inversions, etc. In general cases, the -m option is the best choice,
however -1 can be handy for applications such as SNP finding which
require a 1-to-1 mapping. Finally, for mapping query contigs, or
sequencing reads, to a reference genome, use -q.
3.3 show-coords 参数
Convert alignment information to a .TSV format as required by SyRI
USAGE: show-coords [options] <deltafile>
-b Merges overlapping alignments regardless of match dir
or frame and does not display any idenitity information.
-B Switch output to btab format
-c Include percent coverage information in the output
-d Display the alignment direction in the additional
FRM columns (default for promer)
-g Deprecated option. Please use 'delta-filter' instead
-h Display help information
-H Do not print the output header
-I float Set minimum percent identity to display
-k Knockout (do not display) alignments that overlap
another alignment in a different frame by more than 50%
of their length, AND have a smaller percent similarity
or are less than 75% of the size of the other alignment
(promer only)
-l Include the sequence length information in the output
-L long Set minimum alignment length to display
-o Annotate maximal alignments between two sequences, i.e.
overlaps between reference and query sequences
-q Sort output lines by query IDs and coordinates
-r Sort output lines by reference IDs and coordinates
-T Switch output to tab-delimited format
Input is the .delta output of either the "nucmer" or the
"promer" program passed on the command line.
Output is to stdout, and consists of a list of coordinates,
percent identity, and other useful information regarding the
alignment data contained in the .delta file used as input.
NOTE: No sorting is done by default, therefore the alignments
will be ordered as found in the <deltafile> input.
3.4 syri参数说明:
usage
usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-F {T,S,B}] [-k] [--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--dir DIR] [--prefix PREFIX] [--seed SEED]
[--nc NCORES] [--novcf] [-f] [--nosr] [--tdgaplen TDGL] [--tdmaxolp TDOLP] [-b BRUTERUNTIME] [--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT]
[--inc INCREASEBY] [--no-chrmatch] [--nosv] [--nosnp] [--all] [--allow-offset OFFSET] [--cigar] [-s SSPATH]
Input Files:
-c INFILE File containing alignment coordinates (default: None)
-r REF Genome A (which is considered as reference for the alignments). Required for local variation (large indels, CNVs) identification. (default: None)
-q QRY Genome B (which is considered as query for the alignments). Required for local variation (large indels, CNVs) identification. (default: None)
-d DELTA .delta file from mummer. Required for short variation (SNPs/indels) identification when CIGAR string is not available (default: None)
optional arguments:
-h, --help show this help message and exit
-F {T,S,B} Input file type. T: Table, S: SAM, B: BAM (default: T)
-k Keep intermediate output files (default: False)
--log {DEBUG,INFO,WARN}
log level (default: INFO)
--lf LOG_FIN Name of log file (default: syri.log)
--dir DIR path to working directory (if not current directory). All files must be in this directory. (default: None)
--prefix PREFIX Prefix to add before the output file Names (default: )
--seed SEED seed for generating random numbers (default: 1)
--nc NCORES number of cores to use in parallel (max is number of chromosomes) (default: 1)
--novcf Do not combine all files into one output file (default: False)
-f Filter out low quality alignments (default: True)
SR identification:
--nosr Set to skip structural rearrangement identification (default: False)
--tdgaplen TDGL Maximum allowed gap-length between two alignments of a multi-alignment translocation or duplication (TD). Larger values increases TD
identification sensitivity but also runtime. (default: 500000)
--tdmaxolp TDOLP Maximum allowed overlap between two translocations. Value should be in range (0,1]. (default: 0.8)
-b BRUTERUNTIME Cutoff to restrict brute force methods to take too much time (in seconds). Smaller values would make algorithm faster, but could have marginal
effects on accuracy. In general case, would not be required. (default: 60)
--unic TRANSUNICOUNT Number of uniques bps for selecting translocation. Smaller values would select smaller TLs better, but may increase time and decrease accuracy.
(default: 1000)
--unip TRANSUNIPERCENT
Percent of unique region requried to select translocation. Value should be in range (0,1]. Smaller values would allow selection of TDs which are
more overlapped with other regions. (default: 0.5)
--inc INCREASEBY Minimum score increase required to add another alignment to translocation cluster solution (default: 1000)
--no-chrmatch Do not allow SyRI to automatically match chromosome ids between the two genomes if they are not equal (default: False)
ShV identification:
--nosv Set to skip structural variation identification (default: False)
--nosnp Set to skip SNP/Indel (within alignment) identification (default: False)
--all Use duplications too for variant identification (default: False)
--allow-offset OFFSET
BPs allowed to overlap (default: 5)
--cigar Find SNPs/indels using CIGAR string. Necessary for alignment generated using aligners other than nucmers (default: False)
-s SSPATH path to show-snps from mummer (default: show-snps)
chroder 参数
usage: chroder [-h] [-n NCOUNT] [-o OUT] [-noref] [-F {T,S,B}] coords ref qry
positional arguments:
coords Alignment coordinates in a tsv format
ref Assembly of genome A in multi-fasta format
qry Assembly of genome B in multi-fasta format
optional arguments:
-h, --help show this help message and exit
-n NCOUNT number of N's to be inserted (default: 500)
-o OUT output file prefix (default: out)
-noref Use this parameter when no assembly is at chromosome level
(default: False)
-F {T,S,B} Input coords type. T: Table, S: SAM, B: BAM (default: T)
参考:
http://mummer.sourceforge.net/manual/
syri://www.greatytc.com/p/3571d7019fb7
syri fileformat:https://schneebergerlab.github.io/syri/fileformat.html
MUMmer://www.greatytc.com/p/c12f2a117892
nucmer:http://www.chenlianfu.com/?p=2559