一、NCBI Taxonomy 数据文件
NCBI Taxonomy数据库将所有生物的分类学关系组织为一棵“有根树”(rooted tree), 与进化树(Phylogenetic tree)不同: 进化树是按进化关系”组织,且可以为“无根树”(unrooted tree)。
NCBI Taxonomy公开数据格式有两种,旧的名称为 taxdump.tar.gz
,文件大小约50Mb,内含以下文件。
nodes.dmp [当前版本] 节点信息
重要内容: tax_id, parent tax_id, rank
names.dmp [当前版本] 名称信息
重要内容: tax_id, name_txt
merged.dmp [目前为止] 被合并的节点信息
重要内容: old_tax_id, new_tax_id
delnodes.dmp [目前为止] 被删除的nodes信息
重要内容: tax_id
citations.dmp 引用信息
division.dmp division信息
gencode.dmp 遗传编码信息
gc.prt 遗传编码表
readme.txt 说明文档
其中最主要的是前4个文件:
-
nodes.dmp
主要包含当前版本的所有分类学单元节点(taxon) 的唯一标识符(taxonomic identifier, 简称TaxId, taxid, tax_id), 分类学水平(rank),及其父节点的TaxID。 -
names.dmp
主要包含包含当前版本的所有TaxID及 其统一科学名称(scientific name)和别名。 -
merged.dmp
包含了到当前版本为止,所有被合并的TaxID与合并到的新TaxID。 -
delnodes.dmp
包含了到当前版本为止,所有被删除的TaxID。
在2018年2月的时候,推出了新的格式, 额外包含了谱系(lineage),类型(type)和宿主(host)信息。 文件名称为new_taxdump.tar.gz
,文件大小约110Mb。 相对旧版,新版本文件数量和内容更多,主要是因为增加了lineage和类型信息。 事实上lineage是可以从nodes.dmp
和names.dmp
计算而来。 新版格式所含文件如下:
nodes.dmp
names.dmp
merged.dmp
delnodes.dmp
fullnamelineage.dmp
TaxIDlineage.dmp
rankedlineage.dmp
host.dmp
typeoftype.dmp
typematerial.dmp
citations.dmp
division.dmp
gencode.dmp
readme.txt
NCBI Taxonomy数据每天都在更新,每月初(大多为1号)的数据作为存档保存在 taxdump_archive/
目录, 旧版本最早数据到2014年8月,新版本只到2018年12月。
二、TaxonKit
TaxonKit为命令行工具,采用子命令的方式来执行不同功能, 大多数子命令支持标准输入/输出,便于使用命令行管道进行流水作业, 轻松整合进分析流程中。
子命令 | 功能 |
---|---|
list |
列出指定TaxId下所有子单元的的TaxID |
lineage |
根据TaxID获取完整谱系(lineage) |
reformat |
将完整谱系转化为“界门纲目科属种株"的自定义格式 |
name2taxid |
将分类单元名称转化为TaxID |
filter |
按分类学水平范围过滤TaxIDs |
lca |
计算最低公共祖先(LCA) |
taxid-changelog |
追踪TaxID变更记录 |
version |
显示版本信息、检测新版本 |
genautocomplete |
生成shell自动补全配置脚本 |
-
输出:
- 所有命令输出中包含输入数据内容,在此基础上增加列。
- 所有命令默认输出到标准输出(stdout),可通过重定向(
>
)写入文件。 - 或通过全局参数
-o
或--out-file
指定输出文件,且可自动识别输出文件后缀(.gz
)输出gzip格式。
-
输入:
- 除了
list
与taxid-changelog
之外,lineage
,reformat
,name2taxid
,filter
与lca
均可从标准输入(stdin)读取输入数据,也可通过位置参数(positional arguments)输入,即命令后面不带 任何flag的参数,如taxonkit lineage taxids.txt
- 输入格式为单列,或者制表符分隔的格式,输入数据所在列用
-i
或--taxid-field
指定。
- 除了
TaxonKit直接解析NCBI Taxonomy数据文件(2秒左右),配置更容易,也便于更新数据,占用内存在500Mb-1.5G左右。 数据下载:
# 有时下载失败,可多试几次;或尝试浏览器下载此链接
wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz
# 解压文件存于家目录中.taxonkit/,程序默认数据库默认目录
mkdir -p $HOME/.taxonkit
cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit
1. list
列出指定TaxId所在子树的所有TaxID
taxonkit list
用于列出指定TaxID所在分类学单元(taxon)的子树(subtree)的所有taxon的TaxID,可选显示名称和分类学水平。 此功能与NCBI Taxonomy网页版类似。
如,以人属(9605) 和肠道中著名的Akk菌属(239934)为例
$ taxonkit list --show-rank --show-name --indent " " --ids 9605,239934
9605 [genus] Homo
9606 [species] Homo sapiens
63221 [subspecies] Homo sapiens neanderthalensis
741158 [subspecies] Homo sapiens subsp. 'Denisova'
1425170 [species] Homo heidelbergensis
2665952 [no rank] environmental samples
2665953 [species] Homo sapiens environmental sample
239934 [genus] Akkermansia
239935 [species] Akkermansia muciniphila
349741 [strain] Akkermansia muciniphila ATCC BAA-835
512293 [no rank] environmental samples
512294 [species] uncultured Akkermansia sp.
1131822 [species] uncultured Akkermansia sp. SMG25
1262691 [species] Akkermansia sp. CAG:344
1263034 [species] Akkermansia muciniphila CAG:154
1679444 [species] Akkermansia glycaniphila
2608915 [no rank] unclassified Akkermansia
1131336 [species] Akkermansia sp. KLE1605
...
list使用最广泛的的功能是获取某个类别(比如细菌、病毒、某个属等)下所有的TaxID, 用来从NCBI nt/nr中获取对应的核酸/蛋白序列,从而搭建特异性的BLAST数据库。 官网提供了相应的详细步骤: http://bioinf.shenwei.me/taxonkit/tutorial 。
所有细菌的 TaxID
$ taxonkit list --show-rank --show-name --ids 2 > /dev/null
2. lineage
根据TaxID获取完整谱系
分类学数据相关最常见的功能就是根据TaxID获取完整谱系。 TaxonKit可根据输入文件提供的TaxID列表快速计算lineage,并可选提供名称,分类学水平,以及谱系对应的TaxID。
值得注意的是,随着Taxonomy数据的频繁更新,有的TaxID可能被删除、或合并(merge)到其它TaxID中, TaxonKit会自动识别,并进行提示,对于被合并的TaxID,TaxonKit会按新TaxID进行计算。
# 使用example中的测试数据
$ head taxids.txt
9606
9913
376619
# 查找指定taxids列表的物种信息,tee可输出屏幕并写入文件
$ taxonkit lineage taxids.txt | tee lineage.txt
19:22:13.077 [WARN] taxid 92489 was merged into 796334
19:22:13.077 [WARN] taxid 1458427 was merged into 1458425
19:22:13.077 [WARN] taxid 123124124 not found
19:22:13.077 [WARN] taxid 3 was deleted
9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus
376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS
349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B
11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y
123124124
3
92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae
1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei
与其它软件的性能相比,当查询数量较少时ETE较快,数量较多时则TaxonKit更快。 在不同数据量规模上 TaxonKit速度一直很稳定,均为2-3秒,时间主要花在解析Taxonomy数据文件上。
列出lineage每个分类学单元的的TaxId和rank和名称,比如SARS-COV-2。
# lineage提取SARS-COV-2的世系
$ echo "2697049" \
| taxonkit lineage -t -R \
| sed "s/\t/\n/g"
2697049
Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2
10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049
superkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank
3. reformat
生成标准层级物种注释
有时候,我们并不需要完整的分类学谱系(complete lineage),因为很多级别即不常用,而且不完整。通常只想保留界门纲目科属种。
值得注意的是,不是所有物种都有完整的界门纲目科属种水平,特别是病毒以及一些环境样品。 TaxonKit可以用自定义内容替代缺失的分类单元,如用“__”替代。 更厉害有用的是,TaxonKit还可以用更高层级的分类单元信息来补齐缺失的层级 (-F/--fill-miss-rank
),比如
# 没有genus的病毒
$ echo 1327037 | taxonkit lineage | taxonkit reformat | cut -f 1,3
1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y
# -F参数会用family信息来补齐genus信息
$ echo 1327037 | taxonkit lineage | taxonkit reformat -F | cut -f 1,3
1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y
输出格式可选只输出部分分类学水平,还支持制表符("\t"
),再配合作者的另一个工具csvtk,可以输出漂亮的结果。
其它有用的选项:
-
-P/--add-prefix
:给每个分类学水平添加前缀,比如s__species
。 -
-t/--show-lineage-taxids
:输出分类学单元对应的TaxID。 -
-r/--miss-rank-repl
: 替代没有对应rank的taxon名称 -
-S/--pseudo-strain
: 对于低于species且rank既不是subspecies也不是strain的taxid,使用水平最低taxon名称做为菌株名称。
例,
$ echo -ne "349741\n1327037"\
| taxonkit lineage \
| taxonkit reformat -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" -F -P \
| csvtk cut -t -f -2 \
| csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \
| csvtk pretty -t
taxid kindom phylum class order family genus species
349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila
1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y
# 便于小屏幕查看,用csvtk进行转置
$ echo -ne "349741\n1327037"\
| taxonkit lineage \
| taxonkit reformat -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" -F -P \
| csvtk cut -t -f -2 \
| csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \
| csvtk transpose -t \
| csvtk pretty -H -t
taxid 349741 1327037
kindom k__Bacteria k__Viruses
phylum p__Verrucomicrobia p__Uroviricota
class c__Verrucomicrobiae c__Caudoviricetes
order o__Verrucomicrobiales o__Caudovirales
family f__Akkermansiaceae f__Siphoviridae
genus g__Akkermansia g__unclassified Siphoviridae genus
species s__Akkermansia muciniphila s__Croceibacter phage P2559Y
# 到株水平,以sars-cov-2为例
$ echo -ne "2697049"\
| taxonkit lineage \
| taxonkit reformat -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" -F -P -S \
| csvtk cut -t -f -2 \
| csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species,strain \
| csvtk transpose -t \
| csvtk pretty -H -t
taxid 2697049
kindom k__Viruses
phylum p__Pisuviricota
class c__Pisoniviricetes
order o__Nidovirales
family f__Coronaviridae
genus g__Betacoronavirus
species s__Severe acute respiratory syndrome-related coronavirus
strain t__Severe acute respiratory syndrome coronavirus 2
4. name2taxid
将分类单元名称转化为TaxID
将分类单元名称转化为TaxID非常容易理解,唯一要注意的是 某些TaxId对应相同的名称,比如
# -i指定列,-r显示级别,-L不显示世系
$ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L
Drosophila 7215 genus
Drosophila 32281 subgenus
Drosophila 2081351 genus` </pre>
获取TaxID之后,可以立即传给taxonkit进行后续操作,但要注意用-i
指定TaxId所在列。
5. filter
按分类学水平范围过滤TaxIDs
filter可以按分类学水平范围过滤TaxIDs,注意,不仅仅是特定的Rank,而是一个范围。 比如genus及以下的分类学水平,用-L genus -E genus
,类似于 <= genus
。
$ cat taxids2.txt \
| taxonkit filter -L genus -E genus \
| taxonkit lineage -r -n -L \
| csvtk -Ht cut -f 1,3,2 \
| csvtk pretty -H -t
239934 genus Akkermansia
239935 species Akkermansia muciniphila
349741 strain Akkermansia muciniphila ATCC BAA-835
6. lca
计算最低公共祖先(LCA)
比如人属的例子
$ taxonkit list --ids 9605 -nr --indent " "
9605 [genus] Homo
9606 [species] Homo sapiens
63221 [subspecies] Homo sapiens neanderthalensis
741158 [subspecies] Homo sapiens subsp. 'Denisova'
1425170 [species] Homo heidelbergensis
2665952 [no rank] environmental samples
2665953 [species] Homo sapiens environmental sample
TaxID的分隔符可用-s/--separater
指定,默认为" "。
# 计算两个物种的最近共同祖先,以上面尼安德特人亚种和海德堡人种
$ echo 63221 2665953 | taxonkit lca
63221 2665953 9605
# 其它分隔符,且不小心多了空格
$ echo -ne "a\t63221,2665953\nb\t63221, 741158\n"
a 63221,2665953
b 63221, 741158
$ echo -ne "a\t63221,2665953\nb\t63221, 741158\n" \
| taxonkit lca -i 2 -s ","
a 63221,2665953 9605
b 63221, 741158 9606
7. TaxID-changelog
追踪TaxID变更记录
NCBI Taxonomy数据每天都在更新,每月初(大多为1号)的数据作为存档保存在 taxdump_archive/
目录, 旧版本最早数据到2014年8月,新版本只到2018年12月。
TaxonKit可以追踪所有TaxID每个月的变化,输出到csv文件中,可以通过命令行工具进行查询。 数据和文档单独托管在 https://github.com/shenwei356/taxid-changelog 。
除了简单的增加、删除、合并之外,作者将TaxID改变做了细分。输出格式如下
# 列 备注
taxid # taxid
version # version / time of archive, e.g, 2019-07-01
change # change, values:
# NEW 新增
# REUSE_DEL 前期被删除,现在又重新加入
# REUSE_MER 前期被合并,现在又重新加入
# DELETE 删除
# MERGE 合并到另一个TaxID
# ABSORB 其他TaxID合并到当前TaxID
# CHANGE_NAME 名称改变
# CHANGE_RANK 分类学水平改变
# CHANGE_LIN_LIN 谱系的TaxID没有变化,谱系改变(某些TaxID的名称变了)
# CHANGE_LIN_TAX 谱系的TaxID改变
# CHANGE_LIN_LEN 谱系的长度/深度发生变化
change-value # variable values for changes:
# 1) new taxid for MERGE
# 2) merged taxids for ABSORB
# 3) empty for others
name # scientific name
rank # rank
lineage # complete lineage of the taxid
lineage-taxids # taxids of the lineage
数据文件可以在前面网站上下载,taxid-changelog.csv.gz
,130M左右,解压后2.2G,因为是gzip格式,完全不需要解压即可分析。 下文使用了pigz
代替zcat
和gzip
提高解压速度。
例1 superkingdom也能消失 ,比如类病毒(Viroids)在2019年5月被删除了。 作者是在某一天无意中发现此事,所以决定刨根问底,开发了这个子命令。
# 下载
wget -c https://github.com/shenwei356/taxid-changelog/releases/download/v2021.01/taxid-changelog.csv.gz
# 安装多线程解压索软件。或者用gzip替换。
conda install pigz
$ pigz -cd taxid-changelog.csv.gz \
| csvtk grep -f rank -p superkingdom \
| csvtk pretty
taxid version change change-value name rank lineage lineage-taxids
2 2014-08-01 NEW Bacteria superkingdom cellular organisms;Bacteria 131567;2
2157 2014-08-01 NEW Archaea superkingdom cellular organisms;Archaea 131567;2157
2759 2014-08-01 NEW Eukaryota superkingdom cellular organisms;Eukaryota 131567;2759
10239 2014-08-01 NEW Viruses superkingdom Viruses 10239
12884 2014-08-01 NEW Viroids superkingdom Viroids 12884
12884 2019-05-01 DELETE Viroids superkingdom Viroids 12884
例2 SARS-CoV-2 。可见新冠病毒在2020年2月加入,随后3月和6月份改了名称,谱系等信息。查询速度也很快。
# 本例子只显示了部分列。
$ time pigz -cd taxid-changelog.csv.gz \
| csvtk grep -f taxid -p 2697049 \
| csvtk cut -f version,change,name,rank \
| csvtk pretty
version change name rank
2020-02-01 NEW Wuhan seafood market pneumonia virus species
2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank
2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank
2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank
2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank
2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate
2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank
real 0m7.644s
user 0m16.749s
sys 0m3.985s
三、Tutorial
Table of Contents
- Formatting lineage
- Parsing kraken/bracken result
- Making nr blastdb for specific taxids
- Summaries of taxonomy data
1.Formatting lineage
$ echo "2697049" \
| taxonkit lineage -t \
| csvtk cut -Ht -f 3 \
| csvtk unfold -Ht -f 1 -s ";" \
| taxonkit lineage -r -n -L \
| csvtk cut -Ht -f 1,3,2 \
| csvtk pretty -Ht
10239 superkingdom Viruses
2559587 clade Riboviria
2732396 kingdom Orthornavirae
2732408 phylum Pisuviricota
2732506 class Pisoniviricetes
76804 order Nidovirales
2499399 suborder Cornidovirineae
11118 family Coronaviridae
2501931 subfamily Orthocoronavirinae
694002 genus Betacoronavirus
2509511 subgenus Sarbecovirus
694009 species Severe acute respiratory syndrome-related coronavirus
2697049 no rank Severe acute respiratory syndrome coronavirus 2
Example data.
$ cat taxids3.txt
376619
349741
239935
314101
11932
1327037
83333
1408252
2605619
2697049
Format to 7-level ranks ("superkingdom phylum class order family genus species").
$ cat taxids3.txt \
| taxonkit reformat -I 1
376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis
349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B
11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y
83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus
Format to 8-level ranks ("superkingdom phylum class order family genus species subspecies/rank").
$ cat taxids3.txt \
| taxonkit reformat -I 1 -f "{k};{p};{c};{o};{f};{g};{s};{t}"
376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica LVS
349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;
314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B;
11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle;
1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y;
83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli K-12
1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli R178
2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;
2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus;
Replace missing ranks with Unassigned
and output tab-delimited format.
$ cat taxids3.txt \
| taxonkit reformat -I 1 -r "Unassigned" -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
| csvtk pretty -H -t
376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis Francisella tularensis subsp. holarctica LVS
349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835
239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Unassigned
314101 Bacteria Unassigned Unassigned Unassigned Unassigned Unassigned uncultured murine large bowel bacterium BAC 54B Unassigned
11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle Unassigned
1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Unassigned Croceibacter phage P2559Y Unassigned
83333 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli K-12
1408252 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli R178
2605619 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Unassigned
2697049 Viruses Pisuviricota Pisoniviricetes Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Unassigned
Fill missing ranks and add prefixes.
$ cat taxids3.txt \
| taxonkit reformat -I 1 -F -P -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
| csvtk pretty -H -t
376619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Thiotrichales f__Francisellaceae g__Francisella s__Francisella tularensis t__Francisella tularensis subsp. holarctica LVS
349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__Akkermansia muciniphila ATCC BAA-835
239935 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__unclassified Akkermansia muciniphila subspecies/strain
314101 k__Bacteria p__unclassified Bacteria phylum c__unclassified Bacteria class o__unclassified Bacteria order f__unclassified Bacteria family g__unclassified Bacteria genus s__uncultured murine large bowel bacterium BAC 54B t__unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain
11932 k__Viruses p__Artverviricota c__Revtraviricetes o__Ortervirales f__Retroviridae g__Intracisternal A-particles s__Mouse Intracisternal A-particle t__unclassified Mouse Intracisternal A-particle subspecies/strain
1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y t__unclassified Croceibacter phage P2559Y subspecies/strain
83333 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli K-12
1408252 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli R178
2605619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__unclassified Escherichia coli subspecies/strain
2697049 k__Viruses p__Pisuviricota c__Pisoniviricetes o__Nidovirales f__Coronaviridae g__Betacoronavirus s__Severe acute respiratory syndrome-related coronavirus t__unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain
When these's no nodes of rank "subspecies" nor "strain", we can switch -S/--pseudo-strain
to use the node with lowest rank as subspecies/strain name, if which rank is lower than "species".
$ cat taxids3.txt \
| taxonkit lineage -r -L \
| taxonkit reformat -I 1 -F -S -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
| cut -f 1,2,9,10 \
| csvtk add-header -t -n "taxid,rank,species,strain" \
| csvtk pretty -t
taxid rank species strain
------- ---------- ----------------------------------------------------- ------------------------------------------------------------------------------
376619 strain Francisella tularensis Francisella tularensis subsp. holarctica LVS
349741 strain Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835
239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain
314101 species uncultured murine large bowel bacterium BAC 54B unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain
11932 species Mouse Intracisternal A-particle unclassified Mouse Intracisternal A-particle subspecies/strain
1327037 species Croceibacter phage P2559Y unclassified Croceibacter phage P2559Y subspecies/strain
83333 strain Escherichia coli Escherichia coli K-12
1408252 subspecies Escherichia coli Escherichia coli R178
2605619 no rank Escherichia coli Escherichia coli O16:H48
2697049 no rank Severe acute respiratory syndrome-related coronavirus Severe acute respiratory syndrome coronavirus 2
List eight-level lineage for all TaxIds of rank lower than or equal to species, including some nodes with "no rank". But when filtering with -L/--lower-than
, you can use -n/--save-predictable-norank
to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff.
$ time taxonkit list --ids 1 \
| taxonkit filter -L species -E species -R -N -n \
| taxonkit lineage -n -r -L \
| taxonkit reformat -I 1 -F -S -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
| csvtk cut -Ht -l -f 1,3,2,1,4-11 \
| csvtk add-header -t -n "taxid,rank,name,lineage,kingdom,phylum,class,order,family,genus,species,strain" \
| pigz -c > result.tsv.gz
real 0m25.167s
user 2m14.809s
sys 0m7.197s
$ pigz -cd result.tsv.gz \
| csvtk grep -t -f taxid -p 2697049 \
| csvtk transpose -t \
| csvtk pretty -H -t
taxid 2697049
rank no rank
name Severe acute respiratory syndrome coronavirus 2
lineage Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2
kingdom Viruses
phylum Pisuviricota
class Pisoniviricetes
order Nidovirales
family Coronaviridae
genus Betacoronavirus
species Severe acute respiratory syndrome-related coronavirus
strain Severe acute respiratory syndrome coronavirus 2
2.Parsing kraken/bracken result
Example Data
Run Kraken2 and Bracken
KRAKEN_DB=/home/shenwei/ws/db/kraken/k2_pluspf
THREADS=16
CLASSIFICATION_LVL=S
THRESHOLD=10
READ_LEN=100
SAMPLE=SRS014459-Stool.fasta.gz
BRACKEN_OUTPUT_FILE=$SAMPLE
kraken2 --db ${KRAKEN_DB} --threads ${THREADS} -report ${SAMPLE}.kreport $SAMPLE > ${SAMPLE}.kraken
est_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib \
-l ${CLASSIFICATION_LVL} -t ${THRESHOLD} -o ${BRACKEN_OUTPUT_FILE}.bracken
Orignial format
$ head -n 15 SRS014459-Stool.fasta.gz_bracken_species.kreport
100.00 9491 0 R 1 root
99.85 9477 0 R1 131567 cellular organisms
99.85 9477 0 D 2 Bacteria
66.08 6271 0 D1 1783270 FCB group
66.08 6271 0 D2 68336 Bacteroidetes/Chlorobi group
66.08 6271 0 P 976 Bacteroidetes
66.08 6271 0 C 200643 Bacteroidia
66.08 6271 0 O 171549 Bacteroidales
34.45 3270 0 F 815 Bacteroidaceae
34.45 3270 0 G 816 Bacteroides
10.43 990 990 S 246787 Bacteroides cellulosilyticus
7.98 757 757 S 28116 Bacteroides ovatus
3.10 293 0 G1 2646097 unclassified Bacteroides
1.06 100 100 S 2755405 Bacteroides sp. CACC 737
0.49 46 46 S 2650157 Bacteroides sp. HF-5287
Converting to MetaPhlAn2 format. (Similar to kreport2mpa.py)
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
| csvtk cut -Ht -f 5,1 \
| taxonkit lineage \
| taxonkit reformat -i 3 -P -f "{k}|{p}|{c}|{o}|{f}|{g}|{s}" \
| csvtk cut -Ht -f 4,2 \
| csvtk replace -Ht -p "(\|[kpcofgs]__)+$" \
| csvtk replace -Ht -p "\|[kpcofgs]__\|" -r "|" \
| csvtk uniq -Ht \
| csvtk grep -Ht -p k__ -v \
> SRS014459-Stool.fasta.gz_bracken_species.kreport.format
$ head -n 10 SRS014459-Stool.fasta.gz_bracken_species.kreport.format
k__Bacteria 99.85
k__Bacteria|p__Bacteroidetes 66.08
k__Bacteria|p__Bacteroidetes|c__Bacteroidia 66.08
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales 66.08
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae 34.45
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides 34.45
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides cellulosilyticus 10.43
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus 7.98
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. CACC 737 1.06
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. HF-5287 0.49
Converting to Qiime format
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
| csvtk cut -Ht -f 5,1 \
| taxonkit lineage \
| taxonkit reformat -i 3 -P -f "{k}; {p}; {c}; {o}; {f}; {g}; {s}" \
| csvtk cut -Ht -f 4,2 \
| csvtk replace -Ht -p "(; [kpcofgs]__)+$" \
| csvtk replace -Ht -p "; [kpcofgs]__; " -r "; " \
| csvtk uniq -Ht \
| csvtk grep -Ht -p k__ -v \
| head -n 10
k__Bacteria 99.85
k__Bacteria; p__Bacteroidetes 66.08
k__Bacteria; p__Bacteroidetes; c__Bacteroidia 66.08
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales 66.08
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae 34.45
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides 34.45
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides cellulosilyticus 10.43
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides ovatus 7.98
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. CACC 737 1.06
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. HF-5287 0.49` </pre>
Save taxon proportion and taxid, and get lineage, name and rank.
<pre id="__code_13" style="box-sizing: inherit; color: var(--md-code-fg-color); font-feature-settings: "kern"; font-family: var(--md-code-font-family); margin-bottom: 1em; margin-top: 1em; direction: ltr; display: flow-root; line-height: 1.4; position: relative;">`$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
| csvtk cut -Ht -f 1,5 \
| taxonkit lineage -i 2 -n -r \
| csvtk cut -Ht -f 1,2,5,4,3 \
| head -n 10 \
| csvtk pretty -Ht
100.00 1 no rank root root
99.85 131567 no rank cellular organisms cellular organisms
99.85 2 superkingdom Bacteria cellular organisms;Bacteria
66.08 1783270 clade FCB group cellular organisms;Bacteria;FCB group
66.08 68336 clade Bacteroidetes/Chlorobi group cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group
66.08 976 phylum Bacteroidetes cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes
66.08 200643 class Bacteroidia cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia
66.08 171549 order Bacteroidales cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales
34.45 815 family Bacteroidaceae cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae
34.45 816 genus Bacteroides cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides
Only save species or lower level and get lineage in format of "superkingdom phylum class order family genus species".
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
| csvtk cut -Ht -f 1,5 \
| taxonkit filter -N -E species -L species -i 2 \
| taxonkit lineage -i 2 -n -r \
| taxonkit reformat -i 3 -f "{k};{p};{c};{o};{f};{g};{s}" \
| csvtk cut -Ht -f 1,2,5,4,6 \
| csvtk add-header -t -n abundance,taxid,rank,name,lineage \
| head -n 10 \
| csvtk pretty -t
abundance taxid rank name lineage
--------- ------- ------- ---------------------------- --------------------------------------------------------------------------------------------------------
10.43 246787 species Bacteroides cellulosilyticus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides cellulosilyticus
7.98 28116 species Bacteroides ovatus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides ovatus
1.06 2755405 species Bacteroides sp. CACC 737 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CACC 737
0.49 2650157 species Bacteroides sp. HF-5287 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5287
0.99 2528203 species Bacteroides sp. A1C1 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. A1C1
0.28 2763022 species Bacteroides sp. M10 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. M10
0.16 2650158 species Bacteroides sp. HF-5141 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5141
0.12 2715212 species Bacteroides sp. CBA7301 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CBA7301
5.10 817 species Bacteroides fragilis Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis
3.Making nr blastdb for specific taxids
Attention:
- BLAST+ 2.8.1 is released with new databases, which allows you to limit your search by taxonomy using information built into the BLAST databases. So you don't need to build blastdb for specific taxids now.
Changes:
- 2018-09-13 rewritten
- 2018-12-22 providing faster method for step 3.1
- 2019-01-07 add note of new blastdb version
- 2020-10-14 update steps for huge number of accessions belong to high taxon level like bacteria.
Data:
- pre-formated blastdb (09/10/2018)
- prot.accession2taxid.gz (09/07/2018) (optional, but recommended)
Hardware in this tutorial
- CPU: AMD 8-cores/16-threads 3.7Ghz
- RAM: 64GB
- DISK:
- Taxonomy files stores in NVMe SSD
- blastdb files stores in 7200rpm HDD
Tools:
- blast+
- pigz (recommended, faster than gzip)
- taxonkit
- seqkit (recommended), version >= 0.14.0
- rush (optional, for parallizing filtering sequence)
Steps:
- Listing all taxids below
$id
using taxonkit.
id=6656
# 6656 is the phylum Arthropoda
# echo 6656 | taxonkit lineage | taxonkit reformat
# 6656 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda Eukaryota;Arthropoda;;;;;
# 2 bacteria
# 2157 archaea
# 4751 fungi
# 10239 virus
# time: 2s
taxonkit list --ids $id --indent "" > $id.taxid.txt
# taxonkit list --ids 2,4751,10239 --indent "" > microbe.taxid.txt
wc -l $id.taxid.txt
# 518373 6656.taxid.txt
-
Retrieving target accessions. There are two options:
- From prot.accession2taxid.gz (faster, recommended). Note that some accessions are not in
nr
.
- From prot.accession2taxid.gz (faster, recommended). Note that some accessions are not in
# time: 4min
pigz -dc prot.accession2taxid.gz \
| csvtk grep -t -f taxid -P $id.taxid.txt \
| csvtk cut -t -f accession.version,taxid \
| sed 1d \
> $id.acc2taxid.txt
cut -f 1 $id.acc2taxid.txt > $id.acc.txt
wc -l $id.acc.txt
# 8174609 6656.acc.txt
2. From pre-formated `nr` blastdb
# time: 40min
blastdbcmd -db nr -entry all -outfmt "%a %T" | pigz -c > nr.acc2taxid.txt.gz
pigz -dc nr.acc2taxid.txt.gz | wc -l
# 555220892
# time: 3min
pigz -dc nr.acc2taxid.txt.gz \
| csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \
| cut -d ' ' -f 1 \
> $id.acc.txt
wc -l $id.acc.txt
# 6928021 6656.acc.txt
-
Retrieving FASTA sequences from pre-formated blastdb. There are two options:
- From
nr.fa
exported from pre-formated blastdb (faster, smaller output file, recommended). DO NOT directly downloadnr.gz
from ncbi ftp, in which the FASTA headers are not well formated.
- From
# 1\. exporting nr.fa from pre-formated blastdb
# time: 117min (run only once)
blastdbcmd -db nr -dbtype prot -entry all -outfmt "%f" -out - | pigz -c > nr.fa.gz
# =====================================================================
# 2\. filtering sequence belong to $taxid
# ---------------------------------------------------------------------
# methond 1) (for cases where $id.acc.txt is not very huge)
# time: 80min
# perl one-liner is used to unfold records having mulitple accessions
time cat <(echo) <(pigz -dc nr.fa.gz) \
| perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//; $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } ' \
| seqkit grep -f $id.acc.txt -o nr.$id.fa.gz
# ---------------------------------------------------------------------
# method 2) (**faster**)
# 33min (run only once)
# (1). split nr.fa.gz. # Note: I have 16 cpus.
$ time seqkit split2 -p 15 nr.fa.gz
# (2). parallize unfolding
$ cat _unfold_blastdb_fa.sh
#!/bin/sh
perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//; $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } '
# 10 min
time ls nr.fa.gz.split/nr.part_*.fa.gz \
| rush -j 15 -v id=$id 'cat <(echo) <(pigz -dc {}) \
| ./_unfold_blastdb_fa.sh \
| seqkit grep -f {id}.acc.txt -o nr.{id}.{%@nr\.(.+)$} '
# (3). merge result
cat nr.$id.part*.fa.gz > nr.$id.fa.gz
rm nr.$id.part*.fa.gz
# ---------------------------------------------------------------------
# method 3) (for huge $id.acc.txt file, e.g., bacteria)
# (1). split ${id}.acc.txt into several parts. chunk size depends on lines and RAM (64G for me).
split -d -l 300000000 $id.acc.txt $id.acc.txt.part_
# (2). filter
time ls $id.acc.txt.part_* \
| rush -j 1 --immediate-output -v id=$id \
'echo {}; cat <(echo) <(pigz -dc nr.fa.gz ) \
| ./_unfold_blastdb_fa.sh \
| seqkit grep -f {} -o nr.{id}.{%@(part_.+)}.fa.gz '
# (3). merge
cat nr.$id.part*.fa.gz > nr.$id.fa.gz
# clean
rm nr.$id.part*.fa.gz
rm $id.acc.txt.part_
# (4). optionally adding taxid, you may edit replacement (-r) below
# split
time split -d -l 200000000 $id.acc2taxid.txt $id.acc2taxid.txt.part_
ln -s nr.$id.fa.gz nr.$id.with-taxid.part0.fa.gz
i=0
for f in $id.acc2taxid.txt.part_* ; do
echo $f
time pigz -cd nr.$id.with-taxid.part$i.fa.gz \
| seqkit replace -k $f -p "^([^\-]+?) " -r "{kv}-\$1 " -K -U -o nr.$id.with-taxid.part$(($i+1)).fa.gz;
/bin/rm nr.$id.with-taxid.part$i.fa.gz
i=$(($i+1));
done
mv nr.$id.with-taxid.part$i.fa.gz nr.$id.with-taxid.fa.gz
# =====================================================================
# 3\. counting sequences
#
# ls -lh nr.$id.fa.gz
# -rw-r--r-- 1 shenwei shenwei 902M 9月 13 01:42 nr.6656.fa.gz
#
pigz -dc nr.$id.fa.gz | grep '^>' -c
# 6928017
# Here 6928017 ~= 6928021 ($id.acc.txt)
2. Directly from pre-formated blastdb
# time: 5h20min
blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz
# counting sequences
#
# Note that the headers of outputed fasta by blastdbcmd are "folded"
# for accessions from different species with same sequences, so the
# number may be small than $(wc -l $id.acc.txt).
pigz -dc nr.$id.fa.gz | grep '^>' -c
# 1577383
# counting accessions
#
# ls -lh nr.$id.fa.gz
# -rw-r--r-- 1 shenwei shenwei 2.1G 9月 13 03:38 nr.6656.fa.gz
#
pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\n>/g' | grep '^>' -c
# 288415413
- makeblastdb
pigz -dc nr.$id.fa.gz > nr.$id.fa
# time: 3min ($nr.$id.fa from step 3 option 1)
#
# building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error:
#
# BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1
#
makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id
# rm nr.$id.fa
- blastp (optional)
# blastdb nr.$id is built from sequences in step 3 option 1
#
blastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast
# real 0m20.866s
# $ cat t4.fa.blast | grep Query= -A 10
# Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a
#
# Length=35
Score E
# Sequences producing significant alignments: (Bits) Value
# 2MPQ_A Chain A, Solution structure of the sodium channel toxin Hd1a 72.4 2e-17
# A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-... 72.4 2e-17
# ADB56726.1 HNTX-IV.2 precursor [Haplopelma hainanum] 66.6 9e-15
# D2Y233.1 RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H... 66.6 9e-15
# ADB56830.1 HNTX-IV.3 precursor [Haplopelma hainanum] 66.6 9e-15
4. Summaries of taxonomy data
You can change the TaxId of interest.
- Rank counts of common categories.
$ echo Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae \
| rush -D ' ' -T b \
'taxonkit list --ids $(echo {} | taxonkit name2taxid | cut -f 2) \
| sed 1d \
| taxonkit filter -i 2 -E genus -L genus \
| taxonkit lineage -L -r \
| csvtk freq -H -t -f 2 -nr \
> stats.{}.tsv '
$ csvtk -t join --outer-join stats.*.tsv \
| csvtk add-header -t -n "rank,$(ls stats.*.tsv | rush -k 'echo {@stats.(.+).tsv}' | paste -sd, )" \
| csvtk csv2md -t
[Similar data on NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=statistics&uncultured=hide&unspecified=hide)
| rank | Archaea | Bacteria | Eukaryota | Fungi | Metazoa | Viridiplantae |
| :-- | :-- | :-- | :-- | :-- | :-- | :-- |
| species | 12482 | 460940 | 1349648 | 156908 | 957297 | 191026 |
| strain | 354 | 40643 | 3486 | 2352 | 33 | 50 |
| genus | 205 | 4112 | 90882 | 6844 | 64148 | 16202 |
| isolate | 7 | 503 | 809 | 76 | 17 | 3 |
| species group | 2 | 77 | 251 | 22 | 214 | 5 |
| serotype | | 218 | | | | |
| serogroup | | 136 | | | | |
| subsection | | | 21 | | | 21 |
| subspecies | | 632 | 24523 | 158 | 17043 | 7212 |
| forma specialis | | 521 | 220 | 179 | 33 | 1 |
| species subgroup | | 23 | 101 | | 101 | |
| biotype | | 7 | 10 | | | |
| morph | | | 12 | 3 | 4 | 5 |
| section | | | 437 | 37 | 2 | 398 |
| genotype | | | 12 | | | 12 |
| series | | | 9 | | 5 | 4 |
| varietas | | 25 | 8499 | 1100 | 2 | 7188 |
| forma | | 4 | 560 | 185 | 6 | 315 |
| subgenus | | 1 | 1558 | 10 | 1414 | 112 |
| pathogroup | | 5 | | | | |
| subvariety | | | 5 | | | 5 |
- Count of all ranks
$ time taxonkit list --ids 1 \
| taxonkit lineage -L -r \
| csvtk freq -H -t -f 2 -nr \
| csvtk pretty -H -t
species 1879659
no rank 222743
genus 96625
strain 44483
subspecies 25174
family 9492
varietas 8524
subfamily 3050
tribe 2213
order 1660
subgenus 1618
isolate 1319
serotype 1216
clade 886
superfamily 865
forma specialis 741
forma 564
subtribe 508
section 437
class 429
suborder 372
species group 330
phylum 272
subclass 156
serogroup 138
infraorder 130
species subgroup 124
superorder 55
subphylum 33
parvorder 26
subsection 21
genotype 20
infraclass 18
biotype 17
morph 12
kingdom 11
series 9
superclass 6
cohort 5
pathogroup 5
subvariety 5
superkingdom 4
subcohort 3
subkingdom 1
superphylum 1
real 0m3.663s
user 0m15.897s
sys 0m1.010s
- Ranks of taxa at or below species.
$ taxonkit list --ids 1 \
| taxonkit filter --lower-than species --equal-to species \
| taxonkit lineage -L -r \
| csvtk freq -Ht -nr -f 2 \
| csvtk add-header -t -n rank,count \
| csvtk pretty -t
rank count
--------------- -------
species 1880044
no rank 222756
strain 44483
subspecies 25171
varietas 8524
isolate 1319
serotype 1216
clade 885
forma specialis 741
forma 564
serogroup 138
genotype 20
biotype 17
morph 12
pathogroup 5
subvariety 5
https://bioinf.shenwei.me/taxonkit/tutorial/
每个命令更详细的使用参考:https://bioinf.shenwei.me/taxonkit/usage/
https://github.com/shenwei356/taxonkit/
https://blog.csdn.net/rainforestist/article/details/120838968?
https://blog.csdn.net/weixin_42355003/article/details/112660415?
https://blog.csdn.net/woodcorpse/article/details/113777951
//www.greatytc.com/p/2c1a19ade058
https://wap.sciencenet.cn/blog-3334560-1271157.html?mobile=1