生信分析|MUMmer4+SyRi

前言

Comparing SVs between two genome：use whole-genome alignment

1、MUMmer4

MUMmer is a system for rapidly aligning DNA and protein sequences.

## install 
conda create -n mummer4 -c bioconda mummer4
## use nucmer 对齐(out out.delta)
nucmer --maxmatch --noextend -t 24 Q1ref.fa Q8ref.fa
## delta-filter 整理
delta-filter -1 -l 1000 out.delta > out.filter.delta
## show-coords
show-coords -THrd out.filter.delta > out.filter.coords

2.SyRi

安装

conda create -n syri python=3.5
conda activate syri
conda install cython numpy scipy pandas=0.23.4 biopython psutil matplotlib=3.0.0
conda install -c conda-forge python-igraph
conda install -c bioconda pysam
#use chroder
conda install -c bioconda longestrunsubsequence 
git clone --single-branch --branch V1.4.1 https://github.com.cnpmjs.org/schneebergerlab/syri.git
cd syri
python3 setup.py install                    
chmod +x syri/bin/syri syri/bin/chroder syri/bin/plotsr

## syri
python3 syri_path --nosnp -c out.filtered.coords -d out.delta -r  ref.fa -q query.fa
## syri plot
python3 syri_plotsr_plot syri.out ref.fa query.fa -H 16 -W 12 -o pdf 
## chroder (非必要步骤，query genome没有挂载到染色体时，可用chrorder生成一个)
syri/bin/chroder out.filter.coords Q1ref.fa Q8ref.fa -o out.filter.chroder

参数说明

## syri
--nosnp  Set to skip SNP/Indel (within alignment) identification (default: False)
## syri plot
-H 16   height of the plot
-W 12  width of the plot
-o pdf   output file format(pdf,png,svg)

3、参数说明

3.1 nucmer参数

Usage: nucmer [options] ref:path qry:path+

nucmer generates nucleotide alignments between two mutli-FASTA input
files. The out.delta output file lists the distance between insertions
and deletions that produce maximal scoring alignments between each
sequence. The show-* utilities know how to read this format.

By default, nucmer uses anchor matches that are unique in in the
reference but not necessarily unique in the query. See --mum and
--maxmatch for different bevahiors.

Options (default value in (), *required):
     --mum                                Use anchor matches that are unique in both the reference and query (false)
     --maxmatch                           Use all anchor matches regardless of their uniqueness (false)（可能包含重复序列的结果）
 -b, --breaklen=uint32                    Set the distance an alignment extension will attempt to extend poor scoring regions before giving up (200)
 -c, --mincluster=uint32                  Sets the minimum length of a cluster of matches (65)
 -D, --diagdiff=uint32                    Set the maximum diagonal difference between two adjacent anchors in a cluster (5)
 -d, --diagfactor=double                  Set the maximum diagonal difference between two adjacent anchors in a cluster as a differential fraction of the gap length (0.12)
     --noextend                           Do not perform cluster extension step (false)，设置是否对clustering两端进行延伸。默认设置--extend
 -f, --forward                            Use only the forward strand of the Query sequences (false)
 -g, --maxgap=uint32                      Set the maximum gap between two adjacent matches in a cluster (90)
 -l, --minmatch=uint32                    Set the minimum length of a single exact match (20)
 -L, --minalign=uint32                    Minimum length of an alignment, after clustering and extension (0)
     --nooptimize                         No alignment score optimization, i.e. if an alignment extension reaches the end of a sequence, it will not backtrack to optimize the alignment score and instead terminate the alignment at the end of the sequence (false)
 -r, --reverse                            Use only the reverse complement of the Query sequences (false)
     --nosimplify                         Don't simplify alignments by removing shadowed clusters. Use this option when aligning a sequence to itself to look for repeats (false)
 -p, --prefix=PREFIX                      Write output to PREFIX.delta (out)
     --delta=PATH                         Output delta file to PATH (instead of PREFIX.delta)
     --sam-short=PATH                     Output SAM file to PATH, short format
     --sam-long=PATH                      Output SAM file to PATH, long format
     --save=PREFIX                        Save suffix array to files starting with PREFIX
     --load=PREFIX                        Load suffix array from file starting with PREFIX
     --batch=BASES                        Proceed by batch of chunks of BASES from the reference
 -t, --threads=NUM                        Use NUM threads (# of cores)
 -U, --usage                              Usage
 -h, --help                               This message
     --full-help                          Detailed help
 -V, --version                            Version

nucmer is designed for DNA sequence alignment and proteing is designed for protein or translated sequence alignment

3.2 delta-filter 参数

USAGE: delta-filter  [options]  <deltafile>

-1            1-to-1 alignment allowing for rearrangements
              (intersection of -r and -q alignments)
-g            1-to-1 global alignment not allowing rearrangements
-h            Display help information
-i float      Set the minimum alignment identity [0, 100], default 0
-l int        Set the minimum alignment length, default 0
-m            Many-to-many alignment allowing for rearrangements
              (union of -r and -q alignments)
-q            Maps each position of each query to its best hit in
              the reference, allowing for reference overlaps
-r            Maps each position of each reference to its best hit
              in the query, allowing for query overlaps
-u float      Set the minimum alignment uniqueness, i.e. percent of
              the alignment matching to unique reference AND query
              sequence [0, 100], default 0
-o float      Set the maximum alignment overlap for -r and -q options
              as a percent of the alignment length [0, 100], default 100

  Reads a delta alignment file from either nucmer or promer and
filters the alignments based on the command-line switches, leaving
only the desired alignments which are output to stdout in the same
delta format as the input. For multiple switches, order of operations
is as follows: -i -l -u -q -r -g -m -1. If an alignment is excluded
by a preceding operation, it will be ignored by the succeeding
operations.
  An important distinction between the -g option and the -1 and -m
options is that -g requires the alignments to be mutually consistent
in their order, while the -1 and -m options are not required to be
mutually consistent and therefore tolerate translocations,
inversions, etc. In general cases, the -m option is the best choice,
however -1 can be handy for applications such as SNP finding which
require a 1-to-1 mapping. Finally, for mapping query contigs, or
sequencing reads, to a reference genome, use -q.

3.3 show-coords 参数

Convert alignment information to a .TSV format as required by SyRI

USAGE: show-coords  [options]  <deltafile>

-b          Merges overlapping alignments regardless of match dir
            or frame and does not display any idenitity information.
-B          Switch output to btab format
-c          Include percent coverage information in the output
-d          Display the alignment direction in the additional
            FRM columns (default for promer)
-g          Deprecated option. Please use 'delta-filter' instead
-h          Display help information
-H          Do not print the output header
-I float    Set minimum percent identity to display
-k          Knockout (do not display) alignments that overlap
            another alignment in a different frame by more than 50%
            of their length, AND have a smaller percent similarity
            or are less than 75% of the size of the other alignment
            (promer only)
-l          Include the sequence length information in the output
-L long     Set minimum alignment length to display
-o          Annotate maximal alignments between two sequences, i.e.
            overlaps between reference and query sequences
-q          Sort output lines by query IDs and coordinates
-r          Sort output lines by reference IDs and coordinates
-T          Switch output to tab-delimited format

  Input is the .delta output of either the "nucmer" or the
"promer" program passed on the command line.
  Output is to stdout, and consists of a list of coordinates,
percent identity, and other useful information regarding the
alignment data contained in the .delta file used as input.
  NOTE: No sorting is done by default, therefore the alignments
will be ordered as found in the <deltafile> input.

3.4 syri参数说明：

usage

usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-F {T,S,B}] [-k] [--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--dir DIR] [--prefix PREFIX] [--seed SEED]
            [--nc NCORES] [--novcf] [-f] [--nosr] [--tdgaplen TDGL] [--tdmaxolp TDOLP] [-b BRUTERUNTIME] [--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT]
            [--inc INCREASEBY] [--no-chrmatch] [--nosv] [--nosnp] [--all] [--allow-offset OFFSET] [--cigar] [-s SSPATH]

Input Files:
  -c INFILE             File containing alignment coordinates (default: None)
  -r REF                Genome A (which is considered as reference for the alignments). Required for local variation (large indels, CNVs) identification. (default: None)
  -q QRY                Genome B (which is considered as query for the alignments). Required for local variation (large indels, CNVs) identification. (default: None)
  -d DELTA              .delta file from mummer. Required for short variation (SNPs/indels) identification when CIGAR string is not available (default: None)

optional arguments:
  -h, --help            show this help message and exit
  -F {T,S,B}            Input file type. T: Table, S: SAM, B: BAM (default: T)
  -k                    Keep intermediate output files (default: False)
  --log {DEBUG,INFO,WARN}
                        log level (default: INFO)
  --lf LOG_FIN          Name of log file (default: syri.log)
  --dir DIR             path to working directory (if not current directory). All files must be in this directory. (default: None)
  --prefix PREFIX       Prefix to add before the output file Names (default: )
  --seed SEED           seed for generating random numbers (default: 1)
  --nc NCORES           number of cores to use in parallel (max is number of chromosomes) (default: 1)
  --novcf               Do not combine all files into one output file (default: False)
  -f                    Filter out low quality alignments (default: True)

SR identification:
  --nosr                Set to skip structural rearrangement identification (default: False)
  --tdgaplen TDGL       Maximum allowed gap-length between two alignments of a multi-alignment translocation or duplication (TD). Larger values increases TD
                        identification sensitivity but also runtime. (default: 500000)
  --tdmaxolp TDOLP      Maximum allowed overlap between two translocations. Value should be in range (0,1]. (default: 0.8)
  -b BRUTERUNTIME       Cutoff to restrict brute force methods to take too much time (in seconds). Smaller values would make algorithm faster, but could have marginal
                        effects on accuracy. In general case, would not be required. (default: 60)
  --unic TRANSUNICOUNT  Number of uniques bps for selecting translocation. Smaller values would select smaller TLs better, but may increase time and decrease accuracy.
                        (default: 1000)
  --unip TRANSUNIPERCENT
                        Percent of unique region requried to select translocation. Value should be in range (0,1]. Smaller values would allow selection of TDs which are
                        more overlapped with other regions. (default: 0.5)
  --inc INCREASEBY      Minimum score increase required to add another alignment to translocation cluster solution (default: 1000)
  --no-chrmatch         Do not allow SyRI to automatically match chromosome ids between the two genomes if they are not equal (default: False)

ShV identification:
  --nosv                Set to skip structural variation identification (default: False)
  --nosnp               Set to skip SNP/Indel (within alignment) identification (default: False)
  --all                 Use duplications too for variant identification (default: False)
  --allow-offset OFFSET
                        BPs allowed to overlap (default: 5)
  --cigar               Find SNPs/indels using CIGAR string. Necessary for alignment generated using aligners other than nucmers (default: False)
  -s SSPATH             path to show-snps from mummer (default: show-snps)

chroder 参数

usage: chroder [-h] [-n NCOUNT] [-o OUT] [-noref] [-F {T,S,B}] coords ref qry

positional arguments:
  coords      Alignment coordinates in a tsv format
  ref         Assembly of genome A in multi-fasta format
  qry         Assembly of genome B in multi-fasta format

optional arguments:
  -h, --help  show this help message and exit
  -n NCOUNT   number of N's to be inserted (default: 500)
  -o OUT      output file prefix (default: out)
  -noref      Use this parameter when no assembly is at chromosome level
              (default: False)
  -F {T,S,B}  Input coords type. T: Table, S: SAM, B: BAM (default: T)

参考：

http://mummer.sourceforge.net/manual/
syri：//www.greatytc.com/p/3571d7019fb7
syri fileformat：https://schneebergerlab.github.io/syri/fileformat.html
MUMmer：//www.greatytc.com/p/c12f2a117892
nucmer：http://www.chenlianfu.com/?p=2559

最后编辑于：2022.01.25 11:27:03

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,039评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,426评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 165,417评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,868评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,892评论 6赞 392
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,692评论 1赞 305
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,416评论 3赞 419
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,326评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,782评论 1赞 316
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,957评论 3赞 337
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,102评论 1赞 350
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,790评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,442评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,996评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,113评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,332评论 3赞 373
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,044评论 2赞 355