TaxonKit | A Practical and Efficient NCBI Taxonomy Toolkit

一、NCBI Taxonomy 数据文件

NCBI Taxonomy数据库将所有生物的分类学关系组织为一棵“有根树”(rooted tree), 与进化树(Phylogenetic tree)不同: 进化树是按进化关系”组织,且可以为“无根树”(unrooted tree)。

NCBI Taxonomy公开数据格式有两种,旧的名称为 taxdump.tar.gz ,文件大小约50Mb,内含以下文件。

nodes.dmp        [当前版本] 节点信息
                 重要内容: tax_id, parent tax_id, rank
names.dmp        [当前版本] 名称信息
                 重要内容: tax_id, name_txt
merged.dmp       [目前为止] 被合并的节点信息
                 重要内容: old_tax_id, new_tax_id
delnodes.dmp     [目前为止] 被删除的nodes信息
                 重要内容: tax_id

citations.dmp     引用信息
division.dmp      division信息
gencode.dmp       遗传编码信息
gc.prt            遗传编码表
readme.txt        说明文档

其中最主要的是前4个文件:

  • nodes.dmp 主要包含当前版本的所有分类学单元节点(taxon) 的唯一标识符(taxonomic identifier, 简称TaxId, taxid, tax_id), 分类学水平(rank),及其父节点的TaxID。
  • names.dmp 主要包含包含当前版本的所有TaxID及 其统一科学名称(scientific name)和别名
  • merged.dmp 包含了到当前版本为止,所有被合并的TaxID与合并到的新TaxID。
  • delnodes.dmp 包含了到当前版本为止,所有被删除的TaxID。

在2018年2月的时候,推出了新的格式, 额外包含了谱系(lineage),类型(type)和宿主(host)信息。 文件名称为new_taxdump.tar.gz,文件大小约110Mb。 相对旧版,新版本文件数量和内容更多,主要是因为增加了lineage和类型信息。 事实上lineage是可以从nodes.dmpnames.dmp计算而来。 新版格式所含文件如下:

nodes.dmp
names.dmp
merged.dmp
delnodes.dmp

fullnamelineage.dmp
TaxIDlineage.dmp
rankedlineage.dmp

host.dmp
typeoftype.dmp
typematerial.dmp

citations.dmp
division.dmp
gencode.dmp
readme.txt

NCBI Taxonomy数据每天都在更新,每月初(大多为1号)的数据作为存档保存在 taxdump_archive/ 目录, 旧版本最早数据到2014年8月,新版本只到2018年12月。

二、TaxonKit

TaxonKit为命令行工具,采用子命令的方式来执行不同功能, 大多数子命令支持标准输入/输出,便于使用命令行管道进行流水作业, 轻松整合进分析流程中。

子命令 功能
list 列出指定TaxId下所有子单元的的TaxID
lineage 根据TaxID获取完整谱系(lineage)
reformat 将完整谱系转化为“界门纲目科属种株"的自定义格式
name2taxid 将分类单元名称转化为TaxID
filter 按分类学水平范围过滤TaxIDs
lca 计算最低公共祖先(LCA)
taxid-changelog 追踪TaxID变更记录
version 显示版本信息、检测新版本
genautocomplete 生成shell自动补全配置脚本
  • 输出
    • 所有命令输出中包含输入数据内容,在此基础上增加列。
    • 所有命令默认输出到标准输出(stdout),可通过重定向(>)写入文件。
    • 或通过全局参数-o--out-file指定输出文件,且可自动识别输出文件后缀(.gz)输出gzip格式。
  • 输入
    • 除了listtaxid-changelog之外,lineage, reformat, name2taxid, filterlca 均可从标准输入(stdin)读取输入数据,也可通过位置参数(positional arguments)输入,即命令后面不带 任何flag的参数,如 taxonkit lineage taxids.txt
    • 输入格式为单列,或者制表符分隔的格式,输入数据所在列用-i--taxid-field指定。

TaxonKit直接解析NCBI Taxonomy数据文件(2秒左右),配置更容易,也便于更新数据,占用内存在500Mb-1.5G左右。 数据下载:

# 有时下载失败,可多试几次;或尝试浏览器下载此链接
wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz 
tar -zxvf taxdump.tar.gz

# 解压文件存于家目录中.taxonkit/,程序默认数据库默认目录
mkdir -p $HOME/.taxonkit
cp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit

1. list 列出指定TaxId所在子树的所有TaxID

taxonkit list用于列出指定TaxID所在分类学单元(taxon)的子树(subtree)的所有taxon的TaxID,可选显示名称和分类学水平。 此功能与NCBI Taxonomy网页版类似。

如,以人属(9605) 和肠道中著名的Akk菌属(239934)为例

$ taxonkit list --show-rank --show-name --indent "    " --ids 9605,239934
9605 [genus] Homo
    9606 [species] Homo sapiens
        63221 [subspecies] Homo sapiens neanderthalensis
        741158 [subspecies] Homo sapiens subsp. 'Denisova'
    1425170 [species] Homo heidelbergensis
    2665952 [no rank] environmental samples
        2665953 [species] Homo sapiens environmental sample

239934 [genus] Akkermansia
    239935 [species] Akkermansia muciniphila
        349741 [strain] Akkermansia muciniphila ATCC BAA-835
    512293 [no rank] environmental samples
        512294 [species] uncultured Akkermansia sp.
        1131822 [species] uncultured Akkermansia sp. SMG25
        1262691 [species] Akkermansia sp. CAG:344
        1263034 [species] Akkermansia muciniphila CAG:154
    1679444 [species] Akkermansia glycaniphila
    2608915 [no rank] unclassified Akkermansia
        1131336 [species] Akkermansia sp. KLE1605
    ...

list使用最广泛的的功能是获取某个类别(比如细菌、病毒、某个属等)下所有的TaxID, 用来从NCBI nt/nr中获取对应的核酸/蛋白序列,从而搭建特异性的BLAST数据库。 官网提供了相应的详细步骤: http://bioinf.shenwei.me/taxonkit/tutorial

所有细菌的 TaxID
$ taxonkit list --show-rank --show-name --ids 2 > /dev/null

2. lineage 根据TaxID获取完整谱系

分类学数据相关最常见的功能就是根据TaxID获取完整谱系。 TaxonKit可根据输入文件提供的TaxID列表快速计算lineage,并可选提供名称,分类学水平,以及谱系对应的TaxID。

值得注意的是,随着Taxonomy数据的频繁更新,有的TaxID可能被删除、或合并(merge)到其它TaxID中, TaxonKit会自动识别,并进行提示,对于被合并的TaxID,TaxonKit会按新TaxID进行计算。

# 使用example中的测试数据
$ head taxids.txt
9606
9913
376619
# 查找指定taxids列表的物种信息,tee可输出屏幕并写入文件
$ taxonkit lineage taxids.txt | tee lineage.txt 
19:22:13.077 [WARN] taxid 92489 was merged into 796334
19:22:13.077 [WARN] taxid 1458427 was merged into 1458425
19:22:13.077 [WARN] taxid 123124124 not found
19:22:13.077 [WARN] taxid 3 was deleted
9606    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
9913    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus
376619  cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS
349741  cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
239935  cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
314101  cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B
11932   Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y
123124124
3
92489   cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae
1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei

与其它软件的性能相比,当查询数量较少时ETE较快,数量较多时则TaxonKit更快。 在不同数据量规模上 TaxonKit速度一直很稳定,均为2-3秒,时间主要花在解析Taxonomy数据文件上。

列出lineage每个分类学单元的的TaxId和rank和名称,比如SARS-COV-2。

# lineage提取SARS-COV-2的世系
$ echo "2697049" \
    | taxonkit lineage -t -R \
    | sed "s/\t/\n/g"
2697049
Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2
10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049
superkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank

3. reformat 生成标准层级物种注释

有时候,我们并不需要完整的分类学谱系(complete lineage),因为很多级别即不常用,而且不完整。通常只想保留界门纲目科属种。

值得注意的是,不是所有物种都有完整的界门纲目科属种水平,特别是病毒以及一些环境样品。 TaxonKit可以用自定义内容替代缺失的分类单元,如用“__”替代。 更厉害有用的是,TaxonKit还可以用更高层级的分类单元信息来补齐缺失的层级 (-F/--fill-miss-rank),比如

# 没有genus的病毒
$ echo 1327037 | taxonkit lineage | taxonkit reformat | cut -f 1,3
1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y

# -F参数会用family信息来补齐genus信息
$ echo 1327037 | taxonkit lineage | taxonkit reformat -F | cut -f 1,3
1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y

输出格式可选只输出部分分类学水平,还支持制表符("\t"),再配合作者的另一个工具csvtk,可以输出漂亮的结果。

其它有用的选项:

  • -P/--add-prefix:给每个分类学水平添加前缀,比如s__species
  • -t/--show-lineage-taxids:输出分类学单元对应的TaxID。
  • -r/--miss-rank-repl: 替代没有对应rank的taxon名称
  • -S/--pseudo-strain: 对于低于species且rank既不是subspecies也不是strain的taxid,使用水平最低taxon名称做为菌株名称。

例,

$ echo -ne "349741\n1327037"\
    | taxonkit lineage \
    | taxonkit reformat -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" -F -P \
    | csvtk cut -t -f -2 \
    | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \
    | csvtk pretty -t

taxid     kindom        phylum               class                 order                   family               genus                                species
349741    k__Bacteria   p__Verrucomicrobia   c__Verrucomicrobiae   o__Verrucomicrobiales   f__Akkermansiaceae   g__Akkermansia                       s__Akkermansia muciniphila
1327037   k__Viruses    p__Uroviricota       c__Caudoviricetes     o__Caudovirales         f__Siphoviridae      g__unclassified Siphoviridae genus   s__Croceibacter phage P2559Y



# 便于小屏幕查看,用csvtk进行转置
$ echo -ne "349741\n1327037"\
    | taxonkit lineage \
    | taxonkit reformat -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" -F -P \
    | csvtk cut -t -f -2 \
    | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \
    | csvtk transpose -t \
    | csvtk pretty -H -t

taxid     349741                       1327037
kindom    k__Bacteria                  k__Viruses
phylum    p__Verrucomicrobia           p__Uroviricota
class     c__Verrucomicrobiae          c__Caudoviricetes
order     o__Verrucomicrobiales        o__Caudovirales
family    f__Akkermansiaceae           f__Siphoviridae
genus     g__Akkermansia               g__unclassified Siphoviridae genus
species   s__Akkermansia muciniphila   s__Croceibacter phage P2559Y




# 到株水平,以sars-cov-2为例
$ echo -ne "2697049"\
    | taxonkit lineage \
    | taxonkit reformat -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" -F -P -S \
    | csvtk cut -t -f -2 \
    | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species,strain \
    | csvtk transpose -t \
    | csvtk pretty -H -t

taxid     2697049
kindom    k__Viruses
phylum    p__Pisuviricota
class     c__Pisoniviricetes
order     o__Nidovirales
family    f__Coronaviridae
genus     g__Betacoronavirus
species   s__Severe acute respiratory syndrome-related coronavirus
strain    t__Severe acute respiratory syndrome coronavirus 2

4. name2taxid 将分类单元名称转化为TaxID

将分类单元名称转化为TaxID非常容易理解,唯一要注意的是 某些TaxId对应相同的名称,比如

# -i指定列,-r显示级别,-L不显示世系
$ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L
Drosophila      7215    genus
Drosophila      32281   subgenus
Drosophila      2081351 genus` </pre>

获取TaxID之后,可以立即传给taxonkit进行后续操作,但要注意用-i指定TaxId所在列。

5. filter 按分类学水平范围过滤TaxIDs

filter可以按分类学水平范围过滤TaxIDs,注意,不仅仅是特定的Rank,而是一个范围。 比如genus及以下的分类学水平,用-L genus -E genus,类似于 <= genus

$ cat taxids2.txt \
    | taxonkit filter -L genus -E genus  \
    | taxonkit lineage -r -n -L \
    | csvtk -Ht cut -f 1,3,2 \
    | csvtk pretty -H -t
239934   genus     Akkermansia
239935   species   Akkermansia muciniphila
349741   strain    Akkermansia muciniphila ATCC BAA-835

6. lca 计算最低公共祖先(LCA)

比如人属的例子

$ taxonkit list --ids 9605 -nr --indent "    "    
9605 [genus] Homo
    9606 [species] Homo sapiens
        63221 [subspecies] Homo sapiens neanderthalensis
        741158 [subspecies] Homo sapiens subsp. 'Denisova'
    1425170 [species] Homo heidelbergensis
    2665952 [no rank] environmental samples
        2665953 [species] Homo sapiens environmental sample

TaxID的分隔符可用-s/--separater指定,默认为" "。

# 计算两个物种的最近共同祖先,以上面尼安德特人亚种和海德堡人种
$ echo 63221 2665953 | taxonkit lca
63221 2665953   9605

# 其它分隔符,且不小心多了空格
$ echo -ne "a\t63221,2665953\nb\t63221, 741158\n"
a       63221,2665953
b       63221, 741158

$ echo -ne "a\t63221,2665953\nb\t63221, 741158\n" \
    | taxonkit lca -i 2 -s ","
a       63221,2665953   9605
b       63221, 741158   9606

7. TaxID-changelog 追踪TaxID变更记录

NCBI Taxonomy数据每天都在更新,每月初(大多为1号)的数据作为存档保存在 taxdump_archive/ 目录, 旧版本最早数据到2014年8月,新版本只到2018年12月。

TaxonKit可以追踪所有TaxID每个月的变化,输出到csv文件中,可以通过命令行工具进行查询。 数据和文档单独托管在 https://github.com/shenwei356/taxid-changelog

除了简单的增加、删除、合并之外,作者将TaxID改变做了细分。输出格式如下

# 列            备注
taxid           # taxid
version         # version / time of archive, e.g, 2019-07-01
change          # change, values:
                #   NEW             新增
                #   REUSE_DEL       前期被删除,现在又重新加入
                #   REUSE_MER       前期被合并,现在又重新加入
                #   DELETE          删除
                #   MERGE           合并到另一个TaxID
                #   ABSORB          其他TaxID合并到当前TaxID
                #   CHANGE_NAME     名称改变
                #   CHANGE_RANK     分类学水平改变
                #   CHANGE_LIN_LIN  谱系的TaxID没有变化,谱系改变(某些TaxID的名称变了)
                #   CHANGE_LIN_TAX  谱系的TaxID改变
                #   CHANGE_LIN_LEN  谱系的长度/深度发生变化
change-value    # variable values for changes: 
                #   1) new taxid for MERGE
                #   2) merged taxids for ABSORB
                #   3) empty for others
name            # scientific name
rank            # rank
lineage         # complete lineage of the taxid
lineage-taxids  # taxids of the lineage

数据文件可以在前面网站上下载,taxid-changelog.csv.gz,130M左右,解压后2.2G,因为是gzip格式,完全不需要解压即可分析。 下文使用了pigz代替zcatgzip提高解压速度。

例1 superkingdom也能消失 ,比如类病毒(Viroids)在2019年5月被删除了。 作者是在某一天无意中发现此事,所以决定刨根问底,开发了这个子命令。

# 下载
wget -c https://github.com/shenwei356/taxid-changelog/releases/download/v2021.01/taxid-changelog.csv.gz
# 安装多线程解压索软件。或者用gzip替换。
conda install pigz

$ pigz -cd taxid-changelog.csv.gz \
    | csvtk grep -f rank -p superkingdom \
    | csvtk pretty 
taxid   version      change   change-value   name        rank           lineage                        lineage-taxids
2       2014-08-01   NEW                     Bacteria    superkingdom   cellular organisms;Bacteria    131567;2
2157    2014-08-01   NEW                     Archaea     superkingdom   cellular organisms;Archaea     131567;2157
2759    2014-08-01   NEW                     Eukaryota   superkingdom   cellular organisms;Eukaryota   131567;2759
10239   2014-08-01   NEW                     Viruses     superkingdom   Viruses                        10239
12884   2014-08-01   NEW                     Viroids     superkingdom   Viroids                        12884
12884   2019-05-01   DELETE                  Viroids     superkingdom   Viroids                        12884

例2 SARS-CoV-2 。可见新冠病毒在2020年2月加入,随后3月和6月份改了名称,谱系等信息。查询速度也很快。

# 本例子只显示了部分列。
$ time pigz -cd taxid-changelog.csv.gz \
    | csvtk grep -f taxid -p 2697049 \
    | csvtk cut -f version,change,name,rank \
    | csvtk pretty

version      change           name                                              rank
2020-02-01   NEW              Wuhan seafood market pneumonia virus              species
2020-03-01   CHANGE_NAME      Severe acute respiratory syndrome coronavirus 2   no rank
2020-03-01   CHANGE_RANK      Severe acute respiratory syndrome coronavirus 2   no rank
2020-03-01   CHANGE_LIN_LEN   Severe acute respiratory syndrome coronavirus 2   no rank
2020-06-01   CHANGE_LIN_LEN   Severe acute respiratory syndrome coronavirus 2   no rank
2020-07-01   CHANGE_RANK      Severe acute respiratory syndrome coronavirus 2   isolate
2020-08-01   CHANGE_RANK      Severe acute respiratory syndrome coronavirus 2   no rank

real    0m7.644s
user    0m16.749s
sys     0m3.985s

三、Tutorial

Table of Contents

1.Formatting lineage

$ echo "2697049" \
    | taxonkit lineage -t \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht

10239     superkingdom   Viruses
2559587   clade          Riboviria
2732396   kingdom        Orthornavirae
2732408   phylum         Pisuviricota
2732506   class          Pisoniviricetes
76804     order          Nidovirales
2499399   suborder       Cornidovirineae
11118     family         Coronaviridae
2501931   subfamily      Orthocoronavirinae
694002    genus          Betacoronavirus
2509511   subgenus       Sarbecovirus
694009    species        Severe acute respiratory syndrome-related coronavirus
2697049   no rank        Severe acute respiratory syndrome coronavirus 2

Example data.

$ cat taxids3.txt
376619
349741
239935
314101
11932
1327037
83333
1408252
2605619
2697049

Format to 7-level ranks ("superkingdom phylum class order family genus species").

$ cat taxids3.txt \
    | taxonkit reformat -I 1

376619  Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis
349741  Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
239935  Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
314101  Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B
11932   Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y
83333   Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus

Format to 8-level ranks ("superkingdom phylum class order family genus species subspecies/rank").

$ cat taxids3.txt \
    | taxonkit reformat -I 1 -f "{k};{p};{c};{o};{f};{g};{s};{t}"

376619  Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica LVS
349741  Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
239935  Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;
314101  Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B;
11932   Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle;
1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y;
83333   Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli K-12
1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli R178
2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;
2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus;

Replace missing ranks with Unassigned and output tab-delimited format.

$ cat taxids3.txt \
    | taxonkit reformat -I 1 -r "Unassigned" -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
    | csvtk pretty -H -t

376619    Bacteria   Proteobacteria    Gammaproteobacteria   Thiotrichales        Francisellaceae      Francisella                  Francisella tularensis                                  Francisella tularensis subsp. holarctica LVS
349741    Bacteria   Verrucomicrobia   Verrucomicrobiae      Verrucomicrobiales   Akkermansiaceae      Akkermansia                  Akkermansia muciniphila                                 Akkermansia muciniphila ATCC BAA-835
239935    Bacteria   Verrucomicrobia   Verrucomicrobiae      Verrucomicrobiales   Akkermansiaceae      Akkermansia                  Akkermansia muciniphila                                 Unassigned
314101    Bacteria   Unassigned        Unassigned            Unassigned           Unassigned           Unassigned                   uncultured murine large bowel bacterium BAC 54B         Unassigned
11932     Viruses    Artverviricota    Revtraviricetes       Ortervirales         Retroviridae         Intracisternal A-particles   Mouse Intracisternal A-particle                         Unassigned
1327037   Viruses    Uroviricota       Caudoviricetes        Caudovirales         Siphoviridae         Unassigned                   Croceibacter phage P2559Y                               Unassigned
83333     Bacteria   Proteobacteria    Gammaproteobacteria   Enterobacterales     Enterobacteriaceae   Escherichia                  Escherichia coli                                        Escherichia coli K-12
1408252   Bacteria   Proteobacteria    Gammaproteobacteria   Enterobacterales     Enterobacteriaceae   Escherichia                  Escherichia coli                                        Escherichia coli R178
2605619   Bacteria   Proteobacteria    Gammaproteobacteria   Enterobacterales     Enterobacteriaceae   Escherichia                  Escherichia coli                                        Unassigned
2697049   Viruses    Pisuviricota      Pisoniviricetes       Nidovirales          Coronaviridae        Betacoronavirus              Severe acute respiratory syndrome-related coronavirus   Unassigned

Fill missing ranks and add prefixes.

$ cat taxids3.txt \
    | taxonkit reformat -I 1 -F -P -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
    | csvtk pretty -H -t

376619    k__Bacteria   p__Proteobacteria                 c__Gammaproteobacteria           o__Thiotrichales                 f__Francisellaceae                g__Francisella                       s__Francisella tularensis                                  t__Francisella tularensis subsp. holarctica LVS
349741    k__Bacteria   p__Verrucomicrobia                c__Verrucomicrobiae              o__Verrucomicrobiales            f__Akkermansiaceae                g__Akkermansia                       s__Akkermansia muciniphila                                 t__Akkermansia muciniphila ATCC BAA-835
239935    k__Bacteria   p__Verrucomicrobia                c__Verrucomicrobiae              o__Verrucomicrobiales            f__Akkermansiaceae                g__Akkermansia                       s__Akkermansia muciniphila                                 t__unclassified Akkermansia muciniphila subspecies/strain
314101    k__Bacteria   p__unclassified Bacteria phylum   c__unclassified Bacteria class   o__unclassified Bacteria order   f__unclassified Bacteria family   g__unclassified Bacteria genus       s__uncultured murine large bowel bacterium BAC 54B         t__unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain
11932     k__Viruses    p__Artverviricota                 c__Revtraviricetes               o__Ortervirales                  f__Retroviridae                   g__Intracisternal A-particles        s__Mouse Intracisternal A-particle                         t__unclassified Mouse Intracisternal A-particle subspecies/strain
1327037   k__Viruses    p__Uroviricota                    c__Caudoviricetes                o__Caudovirales                  f__Siphoviridae                   g__unclassified Siphoviridae genus   s__Croceibacter phage P2559Y                               t__unclassified Croceibacter phage P2559Y subspecies/strain
83333     k__Bacteria   p__Proteobacteria                 c__Gammaproteobacteria           o__Enterobacterales              f__Enterobacteriaceae             g__Escherichia                       s__Escherichia coli                                        t__Escherichia coli K-12
1408252   k__Bacteria   p__Proteobacteria                 c__Gammaproteobacteria           o__Enterobacterales              f__Enterobacteriaceae             g__Escherichia                       s__Escherichia coli                                        t__Escherichia coli R178
2605619   k__Bacteria   p__Proteobacteria                 c__Gammaproteobacteria           o__Enterobacterales              f__Enterobacteriaceae             g__Escherichia                       s__Escherichia coli                                        t__unclassified Escherichia coli subspecies/strain
2697049   k__Viruses    p__Pisuviricota                   c__Pisoniviricetes               o__Nidovirales                   f__Coronaviridae                  g__Betacoronavirus                   s__Severe acute respiratory syndrome-related coronavirus   t__unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain

When these's no nodes of rank "subspecies" nor "strain", we can switch -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than "species".

$ cat taxids3.txt \
    | taxonkit lineage -r -L \
    | taxonkit reformat -I 1 -F -S -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
    | cut -f 1,2,9,10 \
    | csvtk add-header -t -n "taxid,rank,species,strain" \
    | csvtk pretty -t

taxid     rank         species                                                 strain
-------   ----------   -----------------------------------------------------   ------------------------------------------------------------------------------
376619    strain       Francisella tularensis                                  Francisella tularensis subsp. holarctica LVS
349741    strain       Akkermansia muciniphila                                 Akkermansia muciniphila ATCC BAA-835
239935    species      Akkermansia muciniphila                                 unclassified Akkermansia muciniphila subspecies/strain
314101    species      uncultured murine large bowel bacterium BAC 54B         unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain
11932     species      Mouse Intracisternal A-particle                         unclassified Mouse Intracisternal A-particle subspecies/strain
1327037   species      Croceibacter phage P2559Y                               unclassified Croceibacter phage P2559Y subspecies/strain
83333     strain       Escherichia coli                                        Escherichia coli K-12
1408252   subspecies   Escherichia coli                                        Escherichia coli R178
2605619   no rank      Escherichia coli                                        Escherichia coli O16:H48
2697049   no rank      Severe acute respiratory syndrome-related coronavirus   Severe acute respiratory syndrome coronavirus 2

List eight-level lineage for all TaxIds of rank lower than or equal to species, including some nodes with "no rank". But when filtering with -L/--lower-than, you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff.

$ time taxonkit list --ids 1 \
    | taxonkit filter -L species -E species -R -N -n \
    | taxonkit lineage -n -r -L \
    | taxonkit reformat -I 1 -F -S -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
    | csvtk cut -Ht -l -f 1,3,2,1,4-11 \
    | csvtk add-header -t -n "taxid,rank,name,lineage,kingdom,phylum,class,order,family,genus,species,strain" \
    | pigz -c > result.tsv.gz

real    0m25.167s
user    2m14.809s
sys     0m7.197s

$ pigz -cd result.tsv.gz \
    | csvtk grep -t -f taxid -p 2697049 \
    | csvtk transpose -t \
    | csvtk pretty -H -t

taxid     2697049
rank      no rank
name      Severe acute respiratory syndrome coronavirus 2
lineage   Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2
kingdom   Viruses
phylum    Pisuviricota
class     Pisoniviricetes
order     Nidovirales
family    Coronaviridae
genus     Betacoronavirus
species   Severe acute respiratory syndrome-related coronavirus
strain    Severe acute respiratory syndrome coronavirus 2

2.Parsing kraken/bracken result

Example Data

Run Kraken2 and Bracken

KRAKEN_DB=/home/shenwei/ws/db/kraken/k2_pluspf
THREADS=16

CLASSIFICATION_LVL=S
THRESHOLD=10

READ_LEN=100
SAMPLE=SRS014459-Stool.fasta.gz

BRACKEN_OUTPUT_FILE=$SAMPLE

kraken2 --db ${KRAKEN_DB} --threads ${THREADS} -report ${SAMPLE}.kreport $SAMPLE > ${SAMPLE}.kraken

est_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib \
    -l ${CLASSIFICATION_LVL} -t ${THRESHOLD} -o ${BRACKEN_OUTPUT_FILE}.bracken

Orignial format

$ head -n 15 SRS014459-Stool.fasta.gz_bracken_species.kreport
100.00  9491    0       R       1       root
99.85   9477    0       R1      131567    cellular organisms
99.85   9477    0       D       2           Bacteria
66.08   6271    0       D1      1783270       FCB group
66.08   6271    0       D2      68336           Bacteroidetes/Chlorobi group
66.08   6271    0       P       976               Bacteroidetes
66.08   6271    0       C       200643              Bacteroidia
66.08   6271    0       O       171549                Bacteroidales
34.45   3270    0       F       815                     Bacteroidaceae
34.45   3270    0       G       816                       Bacteroides
10.43   990     990     S       246787                      Bacteroides cellulosilyticus
7.98    757     757     S       28116                       Bacteroides ovatus
3.10    293     0       G1      2646097                     unclassified Bacteroides
1.06    100     100     S       2755405                       Bacteroides sp. CACC 737
0.49    46      46      S       2650157                       Bacteroides sp. HF-5287

Converting to MetaPhlAn2 format. (Similar to kreport2mpa.py)

$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
    | csvtk cut -Ht -f 5,1 \
    | taxonkit lineage \
    | taxonkit reformat -i 3 -P -f "{k}|{p}|{c}|{o}|{f}|{g}|{s}" \
    | csvtk cut -Ht -f 4,2 \
    | csvtk replace -Ht -p "(\|[kpcofgs]__)+$" \
    | csvtk replace -Ht -p "\|[kpcofgs]__\|" -r "|" \
    | csvtk uniq -Ht \
    | csvtk grep -Ht -p k__ -v \
    > SRS014459-Stool.fasta.gz_bracken_species.kreport.format

$ head -n 10 SRS014459-Stool.fasta.gz_bracken_species.kreport.format

k__Bacteria     99.85
k__Bacteria|p__Bacteroidetes    66.08
k__Bacteria|p__Bacteroidetes|c__Bacteroidia     66.08
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales    66.08
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae  34.45
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides   34.45
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides cellulosilyticus   10.43
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus     7.98
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. CACC 737       1.06
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. HF-5287        0.49

Converting to Qiime format

$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
    | csvtk cut -Ht -f 5,1 \
    | taxonkit lineage \
    | taxonkit reformat -i 3 -P -f "{k}; {p}; {c}; {o}; {f}; {g}; {s}" \
    | csvtk cut -Ht -f 4,2 \
    | csvtk replace -Ht -p "(; [kpcofgs]__)+$" \
    | csvtk replace -Ht -p "; [kpcofgs]__; " -r "; " \
    | csvtk uniq -Ht \
    | csvtk grep -Ht -p k__ -v \
    | head -n 10

k__Bacteria     99.85
k__Bacteria; p__Bacteroidetes   66.08
k__Bacteria; p__Bacteroidetes; c__Bacteroidia   66.08
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales 66.08
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae      34.45
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides      34.45
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides cellulosilyticus     10.43
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides ovatus       7.98
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. CACC 737 1.06
k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. HF-5287  0.49` </pre>

Save taxon proportion and taxid, and get lineage, name and rank.

<pre id="__code_13" style="box-sizing: inherit; color: var(--md-code-fg-color); font-feature-settings: &quot;kern&quot;; font-family: var(--md-code-font-family); margin-bottom: 1em; margin-top: 1em; direction: ltr; display: flow-root; line-height: 1.4; position: relative;">`$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
    | csvtk cut -Ht -f 1,5 \
    | taxonkit lineage -i 2 -n -r \
    | csvtk cut -Ht -f 1,2,5,4,3 \
    | head -n 10 \
    | csvtk pretty -Ht

100.00   1         no rank        root                           root
99.85    131567    no rank        cellular organisms             cellular organisms
99.85    2         superkingdom   Bacteria                       cellular organisms;Bacteria
66.08    1783270   clade          FCB group                      cellular organisms;Bacteria;FCB group
66.08    68336     clade          Bacteroidetes/Chlorobi group   cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group
66.08    976       phylum         Bacteroidetes                  cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes
66.08    200643    class          Bacteroidia                    cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia
66.08    171549    order          Bacteroidales                  cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales
34.45    815       family         Bacteroidaceae                 cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae
34.45    816       genus          Bacteroides                    cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides

Only save species or lower level and get lineage in format of "superkingdom phylum class order family genus species".

$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
    | csvtk cut -Ht -f 1,5 \
    | taxonkit filter -N -E species -L species -i 2 \
    | taxonkit lineage -i 2 -n -r \
    | taxonkit reformat -i 3 -f "{k};{p};{c};{o};{f};{g};{s}" \
    | csvtk cut -Ht -f 1,2,5,4,6 \
    | csvtk add-header -t -n abundance,taxid,rank,name,lineage \
    | head -n 10 \
    | csvtk pretty -t

abundance   taxid     rank      name                           lineage
---------   -------   -------   ----------------------------   --------------------------------------------------------------------------------------------------------
10.43       246787    species   Bacteroides cellulosilyticus   Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides cellulosilyticus
7.98        28116     species   Bacteroides ovatus             Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides ovatus
1.06        2755405   species   Bacteroides sp. CACC 737       Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CACC 737
0.49        2650157   species   Bacteroides sp. HF-5287        Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5287
0.99        2528203   species   Bacteroides sp. A1C1           Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. A1C1
0.28        2763022   species   Bacteroides sp. M10            Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. M10
0.16        2650158   species   Bacteroides sp. HF-5141        Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5141
0.12        2715212   species   Bacteroides sp. CBA7301        Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CBA7301
5.10        817       species   Bacteroides fragilis           Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis

3.Making nr blastdb for specific taxids

Attention:

Changes:

  • 2018-09-13 rewritten
  • 2018-12-22 providing faster method for step 3.1
  • 2019-01-07 add note of new blastdb version
  • 2020-10-14 update steps for huge number of accessions belong to high taxon level like bacteria.

Data:

Hardware in this tutorial

  • CPU: AMD 8-cores/16-threads 3.7Ghz
  • RAM: 64GB
  • DISK:
    • Taxonomy files stores in NVMe SSD
    • blastdb files stores in 7200rpm HDD

Tools:

  • blast+
  • pigz (recommended, faster than gzip)
  • taxonkit
  • seqkit (recommended), version >= 0.14.0
  • rush (optional, for parallizing filtering sequence)

Steps:

  1. Listing all taxids below $id using taxonkit.
id=6656

    # 6656 is the phylum Arthropoda
    # echo 6656 | taxonkit lineage | taxonkit reformat
    # 6656    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda    Eukaryota;Arthropoda;;;;;

    # 2     bacteria
    # 2157  archaea
    # 4751  fungi
    # 10239 virus

    # time: 2s
    taxonkit list --ids $id --indent "" > $id.taxid.txt

    # taxonkit list --ids 2,4751,10239 --indent "" > microbe.taxid.txt

    wc -l $id.taxid.txt
    # 518373 6656.taxid.txt
  1. Retrieving target accessions. There are two options:

    1. From prot.accession2taxid.gz (faster, recommended). Note that some accessions are not in nr.
# time: 4min
        pigz -dc prot.accession2taxid.gz \
            | csvtk grep -t -f taxid -P $id.taxid.txt \
            | csvtk cut -t -f accession.version,taxid \
            | sed 1d \
            > $id.acc2taxid.txt

        cut -f 1 $id.acc2taxid.txt > $id.acc.txt

        wc -l $id.acc.txt
        # 8174609 6656.acc.txt
2.  From pre-formated `nr` blastdb
# time: 40min
        blastdbcmd -db nr -entry all -outfmt "%a %T" | pigz -c > nr.acc2taxid.txt.gz

        pigz -dc nr.acc2taxid.txt.gz | wc -l
        # 555220892

        # time: 3min
        pigz -dc nr.acc2taxid.txt.gz \
            | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \
            | cut -d ' '  -f 1 \
            > $id.acc.txt

        wc -l $id.acc.txt
        # 6928021 6656.acc.txt
  1. Retrieving FASTA sequences from pre-formated blastdb. There are two options:

    1. From nr.fa exported from pre-formated blastdb (faster, smaller output file, recommended). DO NOT directly download nr.gz from ncbi ftp, in which the FASTA headers are not well formated.
# 1\. exporting nr.fa from pre-formated blastdb

        # time: 117min (run only once)
        blastdbcmd -db nr -dbtype prot -entry all -outfmt "%f" -out - | pigz -c > nr.fa.gz

        # =====================================================================

        # 2\. filtering sequence belong to $taxid

        # ---------------------------------------------------------------------

        # methond 1) (for cases where $id.acc.txt is not very huge)
        # time: 80min
        # perl one-liner is used to unfold records having mulitple accessions
        time cat <(echo) <(pigz -dc nr.fa.gz) \
            | perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//;  $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } ' \
            | seqkit grep -f $id.acc.txt -o nr.$id.fa.gz

        # ---------------------------------------------------------------------

        # method 2) (**faster**)

        # 33min (run only once)
        # (1). split nr.fa.gz. # Note: I have 16 cpus.
        $ time seqkit split2 -p 15 nr.fa.gz

        # (2). parallize unfolding
        $ cat _unfold_blastdb_fa.sh
        #!/bin/sh
        perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//;  $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } '

        # 10 min
        time ls nr.fa.gz.split/nr.part_*.fa.gz \
            | rush -j 15 -v id=$id 'cat <(echo) <(pigz -dc {}) \
                | ./_unfold_blastdb_fa.sh \
                | seqkit grep -f {id}.acc.txt -o nr.{id}.{%@nr\.(.+)$} '

        # (3). merge result
        cat nr.$id.part*.fa.gz > nr.$id.fa.gz
        rm nr.$id.part*.fa.gz

        # ---------------------------------------------------------------------

        # method 3) (for huge $id.acc.txt file, e.g., bacteria)

        # (1). split ${id}.acc.txt into several parts. chunk size depends on lines and RAM (64G for me).
        split -d -l 300000000 $id.acc.txt $id.acc.txt.part_

        # (2). filter
        time ls $id.acc.txt.part_* \
            | rush -j 1 --immediate-output -v id=$id \
                'echo {}; cat <(echo) <(pigz -dc nr.fa.gz ) \
                | ./_unfold_blastdb_fa.sh \
                | seqkit grep -f {} -o nr.{id}.{%@(part_.+)}.fa.gz '

        # (3). merge
        cat nr.$id.part*.fa.gz > nr.$id.fa.gz

        # clean
        rm nr.$id.part*.fa.gz
        rm $id.acc.txt.part_

        # (4). optionally adding taxid, you may edit replacement (-r) below
        # split
        time split -d -l 200000000 $id.acc2taxid.txt $id.acc2taxid.txt.part_

        ln -s nr.$id.fa.gz nr.$id.with-taxid.part0.fa.gz         
        i=0
        for f in $id.acc2taxid.txt.part_* ; do
            echo $f
            time pigz -cd nr.$id.with-taxid.part$i.fa.gz \
                | seqkit replace -k $f -p "^([^\-]+?) " -r "{kv}-\$1 " -K -U -o nr.$id.with-taxid.part$(($i+1)).fa.gz;
            /bin/rm nr.$id.with-taxid.part$i.fa.gz
            i=$(($i+1));
        done
        mv nr.$id.with-taxid.part$i.fa.gz nr.$id.with-taxid.fa.gz

        # =====================================================================

        # 3\. counting sequences
        #
        # ls -lh nr.$id.fa.gz
        # -rw-r--r-- 1 shenwei shenwei 902M 9月  13 01:42 nr.6656.fa.gz
        #
        pigz -dc nr.$id.fa.gz | grep '^>' -c

        # 6928017
        # Here 6928017 ~=  6928021 ($id.acc.txt)
2.  Directly from pre-formated blastdb
# time: 5h20min
        blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz

        # counting sequences
        #
        # Note that the headers of outputed fasta by blastdbcmd are "folded"
        # for accessions from different species with same sequences, so the
        # number may be small than $(wc -l $id.acc.txt).
        pigz -dc nr.$id.fa.gz | grep '^>' -c
        # 1577383

        # counting accessions
        #
        # ls -lh nr.$id.fa.gz
        # -rw-r--r-- 1 shenwei shenwei 2.1G 9月  13 03:38 nr.6656.fa.gz
        #
        pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\n>/g' | grep '^>' -c
        # 288415413
  1. makeblastdb
pigz -dc nr.$id.fa.gz > nr.$id.fa

    # time: 3min ($nr.$id.fa from step 3 option 1)
    #
    # building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error:
    #
    #     BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1
    #
    makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id

    # rm nr.$id.fa
  1. blastp (optional)
# blastdb nr.$id is built from sequences in step 3 option 1
    #
    blastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast
    # real    0m20.866s

    # $ cat t4.fa.blast | grep Query= -A 10
    # Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a
    #
    # Length=35
                                                                         Score     E
    # Sequences producing significant alignments:                          (Bits)  Value

    # 2MPQ_A  Chain A, Solution structure of the sodium channel toxin Hd1a  72.4    2e-17
    # A0A0J9X1W9.2  RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-...  72.4    2e-17
    # ADB56726.1  HNTX-IV.2 precursor [Haplopelma hainanum]                 66.6    9e-15
    # D2Y233.1  RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H...  66.6    9e-15
    # ADB56830.1  HNTX-IV.3 precursor [Haplopelma hainanum]                 66.6    9e-15

4. Summaries of taxonomy data

You can change the TaxId of interest.

  1. Rank counts of common categories.
$ echo Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae \
        | rush -D ' ' -T b \
            'taxonkit list --ids $(echo {} | taxonkit name2taxid | cut -f 2) \
                | sed 1d \
                | taxonkit filter -i 2 -E genus -L genus \
                | taxonkit lineage -L -r \
                | csvtk freq -H -t -f 2 -nr \
                > stats.{}.tsv '

    $ csvtk -t join --outer-join stats.*.tsv \
        | csvtk add-header -t -n "rank,$(ls stats.*.tsv | rush -k 'echo {@stats.(.+).tsv}' | paste -sd, )" \
        | csvtk csv2md -t
[Similar data on NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=statistics&uncultured=hide&unspecified=hide)

| rank | Archaea | Bacteria | Eukaryota | Fungi | Metazoa | Viridiplantae |
| :-- | :-- | :-- | :-- | :-- | :-- | :-- |
| species | 12482 | 460940 | 1349648 | 156908 | 957297 | 191026 |
| strain | 354 | 40643 | 3486 | 2352 | 33 | 50 |
| genus | 205 | 4112 | 90882 | 6844 | 64148 | 16202 |
| isolate | 7 | 503 | 809 | 76 | 17 | 3 |
| species group | 2 | 77 | 251 | 22 | 214 | 5 |
| serotype |  | 218 |  |  |  |  |
| serogroup |  | 136 |  |  |  |  |
| subsection |  |  | 21 |  |  | 21 |
| subspecies |  | 632 | 24523 | 158 | 17043 | 7212 |
| forma specialis |  | 521 | 220 | 179 | 33 | 1 |
| species subgroup |  | 23 | 101 |  | 101 |  |
| biotype |  | 7 | 10 |  |  |  |
| morph |  |  | 12 | 3 | 4 | 5 |
| section |  |  | 437 | 37 | 2 | 398 |
| genotype |  |  | 12 |  |  | 12 |
| series |  |  | 9 |  | 5 | 4 |
| varietas |  | 25 | 8499 | 1100 | 2 | 7188 |
| forma |  | 4 | 560 | 185 | 6 | 315 |
| subgenus |  | 1 | 1558 | 10 | 1414 | 112 |
| pathogroup |  | 5 |  |  |  |  |
| subvariety |  |  | 5 |  |  | 5 |
  1. Count of all ranks
$ time taxonkit list --ids 1 \
        | taxonkit lineage -L -r \
        | csvtk freq -H -t -f 2 -nr \
        | csvtk pretty -H -t

    species            1879659
    no rank            222743
    genus              96625
    strain             44483
    subspecies         25174
    family             9492
    varietas           8524
    subfamily          3050
    tribe              2213
    order              1660
    subgenus           1618
    isolate            1319
    serotype           1216
    clade              886
    superfamily        865
    forma specialis    741
    forma              564
    subtribe           508
    section            437
    class              429
    suborder           372
    species group      330
    phylum             272
    subclass           156
    serogroup          138
    infraorder         130
    species subgroup   124
    superorder         55
    subphylum          33
    parvorder          26
    subsection         21
    genotype           20
    infraclass         18
    biotype            17
    morph              12
    kingdom            11
    series             9
    superclass         6
    cohort             5
    pathogroup         5
    subvariety         5
    superkingdom       4
    subcohort          3
    subkingdom         1
    superphylum        1

    real    0m3.663s
    user    0m15.897s
    sys     0m1.010s
  1. Ranks of taxa at or below species.
$ taxonkit list --ids 1 \
        | taxonkit filter --lower-than species --equal-to species \
        | taxonkit lineage -L -r  \
        | csvtk freq -Ht -nr -f 2 \
        | csvtk add-header -t -n rank,count \
        | csvtk pretty -t

    rank              count
    ---------------   -------
    species           1880044
    no rank           222756
    strain            44483
    subspecies        25171
    varietas          8524
    isolate           1319
    serotype          1216
    clade             885
    forma specialis   741
    forma             564
    serogroup         138
    genotype          20
    biotype           17
    morph             12
    pathogroup        5
    subvariety        5

https://bioinf.shenwei.me/taxonkit/tutorial/

每个命令更详细的使用参考:https://bioinf.shenwei.me/taxonkit/usage/

https://github.com/shenwei356/taxonkit/
https://blog.csdn.net/rainforestist/article/details/120838968?
https://blog.csdn.net/weixin_42355003/article/details/112660415?
https://blog.csdn.net/woodcorpse/article/details/113777951
//www.greatytc.com/p/2c1a19ade058
https://wap.sciencenet.cn/blog-3334560-1271157.html?mobile=1

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,922评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,591评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,546评论 0 350
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,467评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,553评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,580评论 1 293
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,588评论 3 414
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,334评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,780评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,092评论 2 330
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,270评论 1 344
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,925评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,573评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,194评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,437评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,154评论 2 366
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,127评论 2 352

推荐阅读更多精彩内容