1. 简介
Flye 原名abruijn,是用于从头拼接以单分子实时测序为特征的如 PacBio and Oxford Nanopore Technologies 的第三代测序数据的软件,可以处理从小型细菌到大型哺乳动物的各种规模的长reads数据集;并且内含用于宏基因组的特殊元件。
2. 下载和安装
使用 conda 安装:
conda install flye
3. 参数详解
输入数据
可以是fasta或fastq的压缩或非压缩形式的文件
也支持从PacBio和ONT读取原始数据或更正后的数据。原始数据的错误率应该< 30%,更正后的reads的错误率应该低于2%。
Oxford Nanopore data
Consensus of multiple contig sets
必须提供预估基因组大小 (0.5倍-2倍范围内) ,用于k-mers选择。
polishing iteration:默认迭代一次,增加迭代次数可以少量纠正额外错误
输入参数
待拼接文件类型和所在位置:
--pacbio-raw path [path ...]
--pacbio-corr path [path ...]
--nano-raw path [path ...]
--nano-corr path [path ...]
--subassemblies path [path ...]
基因组预估大小: -g size, --genome-size
使用线程:-t int, --threads int
polishing iteration次数: -i int, --iterations int
minimum overlap between reads: -m int, --min-overlap int
reduced coverage for initial disjointig assembly:--asm-coverage int,减少内存消耗 30x coverage is enough to produce good disjointigs。
rescue short unassembled plasmids:--plasmids
metagenome / uneven coverage mode:--meta
4. 示例
mkdir E.coli_PacBio_40x
cd E.coli_PacBio_40x
# 下载数据
wget \
-O E.coli.fasta \
https://zenodo.org/record/1172816/files/E.coli_PacBio_40x.fasta
# 拼接
flye \
--pacbio-raw E.coli.fasta \
--out-dir out_pacbio \
--genome-size 5m
需要内存(RAM)>2Gb, 基因组大小:4.6 Mb,需要时间:2 h
5. 输出结果
assembly.fasta - Final assembly. Contains contigs and possibly scaffolds (see below).
assembly_graph.{gfa|gv} - Final repeat graph. Note that the edge sequences might be different (shorter) than contig sequences, because contigs might include multiple graph edges (see below).
assembly_info.txt - Extra information about contigs (such as length,id, coverage)