去冗余
一般我们在组装简单的基因组的时候是不需要去冗余的,像龙眼、花椰菜、木瓜等都属于简单基因组,组装出来就可以进行下一步了,但是有些基因组是有重复的,甚至是高重复的基因组,那我们就需要进行去冗余处理!
在这里推荐一款去冗余的软件--khaper
https://github.com/lardo/khaper
软件的安装
这个软件直接下载,给权限就可以用了,只需要安装jellyfish
git clone https://github.com/lardo/khaper.git
cd khaper/Bin && chmod 755 *
conda install -c bioconda jellyfish
软件的使用
1.Prepare input files
Prepare:
assemble.fasta # genemone assembly with dupplcated sequences.
PE300_1.fq.gz # read1
PE300_2.fq.gz # read2
2.Build the kmer frequency table
ls *.gz > fq.lst
#一般我们的基因组都大于100M,小于10G,所以k我们就设定17就好了
perl Bin/Graph.pl pipe -i fq.lst -m 2 -k 17 -s 1,3 -d Kmer_17
#result:
kmer bit file: Kmer_17/02.Uinque_bit/kmer_17.bit
Note:
a. k=15 is suitable for genome with size <100M.
b. k=17 is suitable for genome with size <10G.
c. This version is only support k<=17.
3.Compress the assembly file
# compress the genome
# Usage:
perl remDup.pl <genome.fa> <outdir> <cutoff:0.7>
Options:
--ref <str> The ref genome to build kbit
--kbit <str> The unique kmer file
--kmer <int> the kmer size [15]
--sort <int> sort seq by length [1]
Description
This script is to remove dupplcation seq
# Demo
perl Bin/remDup.pl --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 assemble.fasta Compress 0.3
# result:
compress file: Compress/trinity.single.fasta.gz
具体的使用
##Prepare input files
HiFi_path=/share/home/off/Work/Genome_assembly/Sre/01.HiFi
contig=/share/home/off/Work/Genome_assembly/Sre/03.Assembly/01.hifiasm/Sre.asm.hic.p_ctg.fa
##Build the kmer frequency table
ls ${HiFi_path}/*.gz > fq.list
perl ~/biosoft/khaper/Bin/Graph.pl pipe -i fq.list -m 2 -k 17 -s 1,3 -d Kmer_17
##Compress the assembly file
#perl ~/biosoft/khaper/Bin/remDup.pl ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 Compress 0.3
##0.3为设置数值,默认的cutoff为0.7,我们可以先依次设置0.6、0.5、0.4再看最后的结果
##Compress为输出文件夹的名称,可自行修改
perl ~/biosoft/khaper/Bin/Graph.pl ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 cutoff_0.6 0.6
perl ~/biosoft/khaper/Bin/Graph.pl ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 cutoff_0.5 0.5
perl ~/biosoft/khaper/Bin/Graph.pl ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 cutoff_0.4 0.4
结果文件
compress file: cutoff_0.6/trinity.single.fasta.gz
那么去冗余的标准又是什么呢,达到预估基因组大小即可,再进行BUSCO的评估,BUSCO评估值没有下降很多。