去冗余

一般我们在组装简单的基因组的时候是不需要去冗余的，像龙眼、花椰菜、木瓜等都属于简单基因组，组装出来就可以进行下一步了，但是有些基因组是有重复的，甚至是高重复的基因组，那我们就需要进行去冗余处理！

在这里推荐一款去冗余的软件--khaper

软件的安装

这个软件直接下载，给权限就可以用了，只需要安装jellyfish

git clone https://github.com/lardo/khaper.git
cd khaper/Bin   &&  chmod 755 *
conda install -c bioconda jellyfish

软件的使用

1.Prepare input files

Prepare:
assemble.fasta  # genemone assembly with dupplcated sequences.
PE300_1.fq.gz       # read1
PE300_2.fq.gz       # read2

2.Build the kmer frequency table

ls *.gz > fq.lst
#一般我们的基因组都大于100M，小于10G，所以k我们就设定17就好了
perl Bin/Graph.pl pipe -i fq.lst -m 2 -k 17 -s 1,3 -d Kmer_17
#result:
kmer bit file: Kmer_17/02.Uinque_bit/kmer_17.bit

Note:

a. k=15 is suitable for genome with size <100M.
b. k=17 is suitable for genome with size <10G.
c. This version is only support k<=17.

3.Compress the assembly file

# compress the genome

# Usage:
 perl remDup.pl <genome.fa> <outdir> <cutoff:0.7>

     Options:
            --ref   <str> The ref genome to build kbit
          --kbit  <str> The unique kmer file
            --kmer  <int> the kmer size [15]
          --sort  <int> sort seq by length [1]

Description
     This script is to remove dupplcation seq

# Demo
perl Bin/remDup.pl  --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 assemble.fasta Compress 0.3

# result:
compress file: Compress/trinity.single.fasta.gz

具体的使用

##Prepare input files
HiFi_path=/share/home/off/Work/Genome_assembly/Sre/01.HiFi
contig=/share/home/off/Work/Genome_assembly/Sre/03.Assembly/01.hifiasm/Sre.asm.hic.p_ctg.fa
##Build the kmer frequency table
ls ${HiFi_path}/*.gz > fq.list
perl ~/biosoft/khaper/Bin/Graph.pl pipe -i fq.list -m 2 -k 17 -s 1,3 -d Kmer_17
##Compress the assembly file
#perl ~/biosoft/khaper/Bin/remDup.pl ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 Compress 0.3
##0.3为设置数值，默认的cutoff为0.7，我们可以先依次设置0.6、0.5、0.4再看最后的结果
##Compress为输出文件夹的名称，可自行修改
perl ~/biosoft/khaper/Bin/Graph.pl  ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 cutoff_0.6 0.6
perl ~/biosoft/khaper/Bin/Graph.pl  ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 cutoff_0.5 0.5
perl ~/biosoft/khaper/Bin/Graph.pl  ${contig} --kbit Kmer_17/02.Uinque_bit/kmer_17.bit --kmer 17 cutoff_0.4 0.4

结果文件

compress file: cutoff_0.6/trinity.single.fasta.gz

那么去冗余的标准又是什么呢，达到预估基因组大小即可，再进行BUSCO的评估，BUSCO评估值没有下降很多。

参考链接

https://github.com/lardo/khaper

基因组去冗余（一）