Gepard软件能快速对两个fasta格式的序列进行共线性分析,获得二维点图。我利用该软件判断由GetOrganelle组装获得的叶绿体ssc区的方向。尽管这是2007年就发布的软件[1],但应付上述目的绰绰有余。
之前已经尝试过windows的版本(参考[3]),但当有大量组装完成的叶绿体ssc需要判断时,就有必要批处理了。Gepard的linux安装利用conda或Docker,相当便捷,参考https://github.com/univieCUBE/gepard。
安装后键入Gepardcmd
弹出help信息
Gepard 2.0 - command line mode
Reference:
Krumsiek J, Arnold R, Rattei T
Gepard: A rapid and sensitive tool for creating dotplots on genome scale.
Bioinformatics 2007; 23(8): 1026-8. PMID: 17309896
Parameters are supplied as -name value
Required parameters:
-seq: the sequences, seperated by spaces. The first gets paired to the second, third to fourth and so on.
-matrix: substitution matrix file
-outfile: output file name
... (Following omission)
-seq, -matrix, -outfile
是必需的,-seq, -outfile很好理解,与windows界面不同的是需要键入-matrix,这是要求输入一个核苷酸替代矩阵,官方tutorial推荐matrices/edna.mat。conda安装的通过以下命令找到这个替代矩阵:
$ which gepard
~/miniconda3/envs/gepard/bin/gepard
$ cd ~/miniconda3/envs/gepard
$ find -name edna.mat
./share/gepard/resources/matrices/edna.mat
./share/gepard/src/matrices/edna.mat
$ cd ./share/gepard/resources/matrices
$ less edna.mat
#
# This matrix was created by Todd Lowe 12/10/92
#
# Uses ambiguous nucleotide codes, probabilities rounded to
# nearest integer
#
# Lowest score = -4, Highest score = 5
#
# modified for use with gepard (delimiter letter Z)
A T G C N W R Y K M B V H D S U Z X
A 1 0 0 0 -2 -4 1 1 -4 -4 1 -4 -1 -1 -1 -4 -9 -9
T 0 1 0 0 -2 -4 1 -4 1 1 -4 -1 -4 -1 -1 5 -9 -9
G 0 0 1 0 -2 1 -4 1 -4 1 -4 -1 -1 -4 -1 -4 -9 -9
C 0 0 0 1 -2 1 -4 -4 1 -4 1 -1 -1 -1 -4 -4 -9 -9
N -2 -2 -2 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -9 -9
W -4 -4 1 1 -1 -1 -4 -2 -2 -2 -2 -1 -1 -3 -3 -4 -9 -9
R 1 1 -4 -4 -1 -4 -1 -2 -2 -2 -2 -3 -3 -1 -1 1 -9 -9
Y 1 -4 1 -4 -1 -2 -2 -1 -4 -2 -2 -3 -1 -3 -1 -4 -9 -9
K -4 1 -4 1 -1 -2 -2 -4 -1 -2 -2 -1 -3 -1 -3 -1 -9 -9
M -4 1 1 -4 -1 -2 -2 -2 -2 -1 -4 -1 -3 -3 -1 1 -9 -9
B 1 -4 -4 1 -1 -2 -2 -2 -2 -4 -1 -3 -1 -1 -3 -4 -9 -9
V -4 -1 -1 -1 -1 -1 -3 -3 -1 -1 -3 -1 -2 -2 -2 -1 -9 -9
H -1 -4 -1 -1 -1 -1 -3 -1 -3 -3 -1 -2 -1 -2 -2 -4 -9 -9
D -1 -1 -4 -1 -1 -3 -1 -3 -1 -3 -1 -2 -2 -1 -2 -1 -9 -9
S -1 -1 -1 -4 -1 -3 -1 -1 -3 -1 -3 -2 -2 -2 -1 -1 -9 -9
U -4 5 -4 -4 -2 -4 1 -4 1 1 -4 -1 -4 -1 -1 5 -9 -9
Z -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9
X -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9
知道矩阵存放位置后,还需找到软件存放地址(因为我试验环境变量的命令直接调用跑不通,不清楚是为什么),实际上就是.../share/gepard
下dist
,里面有Gepard-1.40.jar Gepard-2.1.jar两个文件,使用Gepard-2.1.jar(两个程序都可用,但命令略有不同,以后者为例)。
然后就可以运行软件了。由于官方tutorial文件没有及时更新,实际的运行命令应该是(对应Gepard-2.1版本,即现在下载默认的版本)[2]:
java -cp ~/miniconda3/envs/gepard/share/gepard/dist/Gepard-2.1.jar org.gepard.client.cmdline.CommandLine \
-seq ref.fasta test.fasta \
-matrix ~/miniconda3/envs/gepard/share/gepard/resources/matrices/edna.mat \
-outfile test1.png
运行会报错并弹出一个窗口,需要安装xmanager 11什么的,就按指示安装,第一次安装有个30天试用,管他呢,先用了,后面用到再想办法...:
Loading substitution matrix...
Loading sequence from ref.fasta
Loading sequence from test.fasta
Calculating suffix array...
Calculating dotplot...
Creating image and writing to file...
Exception in thread "main" java.awt.AWTError: Can't connect to X11 window server using 'localhost:12.0' as the value of the DISPLAY variable.
at java.desktop/sun.awt.X11GraphicsEnvironment.initDisplay(Native Method)
at java.desktop/sun.awt.X11GraphicsEnvironment$1.run(X11GraphicsEnvironment.java:104)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.desktop/sun.awt.X11GraphicsEnvironment.<clinit>(X11GraphicsEnvironment.java:63)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:315)
at java.desktop/java.awt.GraphicsEnvironment$LocalGE.createGE(GraphicsEnvironment.java:101)
at java.desktop/java.awt.GraphicsEnvironment$LocalGE.<clinit>(GraphicsEnvironment.java:83)
at java.desktop/java.awt.GraphicsEnvironment.getLocalGraphicsEnvironment(GraphicsEnvironment.java:129)
at java.desktop/java.awt.image.BufferedImage.createGraphics(BufferedImage.java:1181)
at java.desktop/java.awt.image.BufferedImage.getGraphics(BufferedImage.java:1170)
at org.gepard.client.Plotter.<init>(Plotter.java:92)
at org.gepard.client.cmdline.CommandLine.main(CommandLine.java:304)
安装完再运行,就正常,
Loading substitution matrix...
Loading sequence from ref.fasta
Loading sequence from test.fasta
Calculating suffix array...
Calculating dotplot...
Creating image and writing to file...
最后写个循环就实现批处理啦!
参考资料:
[1] Krumsiek J, Arnold R, Rattei T. Gepard: a rapid and sensitive tool for creating dotplots on genome scale[J]. Bioinformatics, 2007, 23(8): 1026-1028.
[2] How to start gepard on the commandline.
[3] 被子植物·叶绿体组装、注释与比较分析·框架