popABC：群体历史动态模拟

其是由R和C++编写的程序，C++部分进行模拟分析的，有关R的部分则负责作图；也可用别的软件代替R来画图。

其他的模拟方法：
add：可以用它来生成大量模拟数据；R包'abc'：对数据进行取样等各种分析

特点：

popABC能够模拟分枝间的基因流，而DIYabc不能；
popABC能够自动检测所有可能的拓扑结构（不超过5个居群）；DIYabc必须手动设定拓扑结构；
popABC只支持二歧分枝，不支持多居群从同一点炸开；而DIYabc可以；
popABC只能单核，速度较慢（补救方法是多个实例一起跑，然后把结果拼在一起）；DIYabc自动多线程，占用所有CPU资源；

下载地址：https://github.com/jsollari/popABC

安装配置：下载解压到目标目录后，将解压文件的bin目录路径添加到~/.bashrc文件中，在终端运行命令source ~/.bashrc即可

分析流程：

1.准备所需的文件；定义需要的统计量；定义模型和参数；

2.计算实际数据的统计量；

3.生成模拟数据，同时依据前面定义的统计量对每一份模拟数据计算统计量；

4.像其他的统计检验一样，看一下在一堆模拟数据的统计量中，定义的统计量在什么位置，查看各种分布……从而得出所需的结论，及各统计量的分布。

1.1 创建.pop文件

按照popABC给出的示例文件编写pop文件；
注意，要用's'或'm'注名每一个locus是SNP还是microsatelite。

1.2 由.pop文件生成.len文件

.len文件是popABC所使用的特殊格式；终端输入popabc，依次选择‘4’：tools；‘3’：pop文件转为len文件；输入.pop文件的名字；输入要生成的.len文件名；‘9’；‘5’；用文本编辑器打开生成的.len文件，检查文件是否为空，并核对居群数目、位点数目和位点类型

1.3 生成.sst文件（定义所需统计量）

sst文件定义了所需的统计量，分为两块，先是微卫星的统计量，再是SNP的统计量，0表示不计算，1表示计算，如下：
Microsatellites data:
-heterozygosity #1
-variance of alleles length #0
-number of alleles #1
-curtosis of alleles length #0
-Shanon's index 　#1
-Nm estimator based on H 　#1
Sequence data:
-mean of pairwise differences #0
-number of segregating sites #0
-number of haplotypes #0
-Shanon's index #0
-mean of MFS #0
-stdev of MFS #0
-Nm estimator based on S #0
-private segregating sites #0
-S(1) #0

1.4 制作.prs文件（定义模型、参数，和运行参数）

整个文件由分割线分成了三块；第一块是各种数字；第二块是各数字的注释（与第一块对应）；第三块是标示一下有哪些树形结构。example中提供了1pop.prs－5pop.prs等若干示例；直接按照自己的居群数选择对应的示例进行修改即可；最好把第二块复制到一个新的文件里，和原文件的第一块对应着看，然后按自己的需求进行修改。第四行数据是树的拓扑结构，直接改成‘0’，即可检验所有可能的树形结构。

.prs文件示例如下：

100 1 3 3
1 1 0.75
s m s
5 2 0 1 0 10
1 0 5000
1 0 5000
1 0 5000
1 0 5000
1 0 5000
4 20 100 1 1
3 0 10
3 0 10
3 0 10
3 0 10
1 -3 0 0 0
1 -3 0 0 0
0
0
0

PopABC - Mark Beaumont & Joao Lopes 　01/05/09
no_iterations, generation_time, no_populations, no_loci
escalar per locus (autosome - 1; X-linked - 0.75; <wbr> Y-linked or mtDNA - 0.25)
type of DNA data (s - sequence; m - microssatelites)
topology: 0 - uniform distribution;
　　　　　1 - choose topology from a list;
　　　　　2 - specify topology manually [e.g. ((Pop1,Pop2)Pop3) -> 1 2 2 3];
　　　　　3 - uniform distribution (and choose a Model marker);
　　　　　4 - choose topology from a list (and choose a Model marker);
　　　　　5 - specify topology manually (and choose a Model marker).
ne1 params: 1 - uniform distribtuion;
　　　　　　　2 - generalized gamma distribution;
ne2 params
ne3 params
neanc1 params
neanc2 params
t1 params: 1 - uniform distribtuion;
　　　　　　2 - generalized gamma distribution;3 - uniform distribtuion (for all time events);

4 - generalized gamma distribution (for all time events);
　　　　　　[for 1 and 2 t(n) is added to t(n+1)]
　　　　　　[for 3 and 4 set only one priors for all t(n)]

t2
mig1 params: 0 - no migration;
　　　　　　　1 - uniform distribtuion;
　　　　　　　2 - generalized gamma distribution;
　　　　　　　3 - uniform distribtuion (on number of migrations);
　　　　　　　4 - generalized gamma distribution (on number of migrations).
　　　　　　　[for 3 and 4 real mig rate is calculated as nmig/Ne]
mig2 params
mig3 params
miganc1 params
mutM params: 0 - no mutation
　　　　　　　1 - lognormal distribution: (mean of mean(log10); stdev of mean(log10);
mean of Sdev(log10); stdev of stdev(log10). Stdev truncated at 0.
　　　　　　　2 - normal distribution: (mean of mean; stdev of mean; mean of Sdev;
stdev of stdev. Stdev truncated at 0.
mutS params:
recM params: 0 - no recombination;
　　　　　　　1 - lognormal distribution: (mean of mean(log10); stdev of mean(log10);
mean of Sdev(log10); stdev of stdev(log10). Stdev truncated at 0.
　　　　　　　2 - normal distribution: (mean of mean; stdev of mean; mean of Sdev;
stdev of stdev. Stdev truncated at 0.
recS params
migweight: 0 - do not use migweights matrix;
　　　　　　1 - use migweights matrix as following:0 mw112 mw113
　　　　　　0 mw122 mw123
　　　　　　mw211 0 mw213
　　　　　　mw221 0 mw223
　　　　　　mw311 mw312 0
　　　　　　mw321 mw322 0
, where mwitj is the prob that the fraction of migrantes in pop i comes from pop j at a period of time before time event t. Sum of prob should be equal to 1. [only use migweight if the topology is specified (option 1,2,4 or 5)]

Tree topology:

|| Neanc2
|| |
t2|| ------------
|| | |
|| Neanc1 |
|| | |
|| | |
t1|| --------- |
|| | | |
|| | | |
/ Ne Ne Ne

2.生成.trg等一系列文件（计算实际数据的各统计量）

终端运行如下命令：

summdata example.len example.sst example

共生成三个文件：example.trg; example.ssz;example_trg.txt；其中example_trg.txt是工作日志，出错时可以检验其内容；example.ssz文件里是各位点的size；example.trg文件是实际数据的各统计量；可以打开example.trg文件看一下，数目和数值是否有异常。

3.1 生成模拟数值

终端运行如下命令：

simulate example.prs example.ssz example.sst example

最终会生成文件名example.dat文件和同名的example.txt文件;每生成5万个数据，在终端窗口有一次提醒。在生成的过程中，也可notepad打开.dat文件，看看生成到第几个了，从而预估完成时间。其中.txt文件是工作日志，后面会用到。example.dat文件是需要的模拟数据,里面每一行是一次模拟结果。

3.2 运行多个实例

开启多个终端窗口，分别运行上面的命令；每次修改一个不同的输出文件名即可。

注意：不可从同一个文件夹运行多个实例，因为软件会在文件夹下没有后缀名的INTFILE文件里存储随机种子，多个实例会互相干扰；要把整个文件夹复制多份，每一个文件夹里运行一个实例；复制后，每个文件夹下INTFILE里的随机种子是相同的，运行的结果也会是相同的，应在每一个文件夹下，打开终端窗口，然后运行：

shuffle.exe xxxxx #xxxxx是自己随手敲的一个随机数

3.3 连续运行实例

当设置的iteration数目比较大（例如500万）时，软件会提示，计算量太大/文件太大，不能算;但是，这个“太大”是按照十年前计算机的标准……,可以把多条命令行写到同一个.bat文件里，两条命令之间要shuffle一次。准备多个这样的.bat文件（注意每一条的自定义结果名不能重复），同时运行。

3.4 多个实例运行结果的拼接

把所有的.dat文件复制到同一个文件夹下。使用R语言把它们拼接到一起。打开R语言（推荐Rstudio.有时不支持中文路径名），运行如下命令：

setwd(choose.dir()) #设定工作目录：在弹出的窗口里选择放dat文件的文件夹
choose.files(default = getwd())->filestocombine #选择dat文件：在弹出的窗口里选择所有要拼接的dat文件
outfile <- 'combinedData.dat' #设置输出文件名：combinedData.dat，或者别的什么名字。**不要和已有文件重复。**

如果dat文件都不太大,则可运行如下：

countlines <- 0
for(i in 1:length(filestocombine)){
               read.table(file = filestocombine[i])->towrite
               countlines <- countlines + nrow(towrite)
               write.table(towrite,file = outfile, quote = F, row.names = F, col.names = F, sep = ' ', append = T)
cat(i,'in',length(filestocombine),'\n')}
cat('total rows of data:', countlines, '\n')
rm(towrite)

如果dat文件很大（几百个m那种）：

install.packages('data.table') #第一次使用data.table先安装
library('data.table') #安装之后，每次不需要重复上步，直接library载入即可。
countlines <- 0
for(i in 1:length(filestocombine)){
              fread(input = filestocombine[i]) ->towrite
              countlines <- countlines + nrow(towrite)
              fwrite(towrite,file = outfile, quote = F, row.names = F, col.names = F, sep = ' ', append = T)
             cat(i,'in',length(filestocombine),'\n')}
cat('total rows of data:', countlines, '\n')
rm(towrite)

拼接过程中，输出的数字是拼到第几个文件了；全部完成后，会输出总的模拟数目：'total rows of data: xxxxx’。如果之前程序被强行中止的话，.dat文件最后一行会不完整。

===以下的内容都无效了，改用R语言包'abc'对模拟结果进行分析。====

4.1 挑出与实际结果最接近的一批模拟

终端运行：

rejection  拼接的结果文件.dat 之前生成的.trg 自定义的输出文件名 x y rate

其中，x为参数（parameter）数目，y为统计数据（summstats）数目，这两个数据可以在模拟时生成的任意一个日志txt中看到（用notepad打开）；rate是挑选的模拟的比例，例如0.01，就是挑选与实际结果最接近的百分之一；最终生成3个文件：一个.pri文件，一个.rej文件，一个_rej.txt文件；其中，.pri文件里是一万个随机挑选出来的模拟结果（仅包含参数），用来代表所有的模拟结果；.rej文件里是按比例挑出的与实际结果最接近的模拟结果（包含参数和统计数据）；_rej.txt里是工作日志，里面可以看到所有模拟结果中每个统计数据的分布。

4.2 在R语言绘图

程序的scripts文件中提供了绘图的脚本，但是并不那么好用；可以自己修改，以满足自己的需要。

setwd(choose.dir()) #设置工作目录
abc.rej <- data.matrix(read.table(choose.files())) #选择并读取分析所用的.rej文件
priors <- data.matrix(read.table(choose.files())) #选择并读取分析所用的.pri文件

画不同model对应的后验概率柱状图

hist(abc.rej[,1],freq=F,main="prior (blue) and posterior (grey) distributions",xlab="top",col="grey",xlim = c(0,4), breaks = seq(0.5,3.5,1), labels = c(as.character(1:3))) 
#rej中第一列是"top"参数，即不同的模型;每个模型编号对应的树形见随软件的操作手册的最后部分;
#main是图的标题，xlab是x轴的标题，col为颜色，xlim为x轴的范围，breaks是按什么数值截断来画柱状图;
#labels是每个柱子上标什么字符。3个居群时，有3个树形模型。其它情况时xlim/breaks和labels要对应修改。
hist(priors[,1],freq=F,col=rgb(0,0,0.5, a=0.5),add=T,breaks = seq(0.5,3.5,1), density=9)
#在图上添加参数的先验分布（背景）。rgb()中的a表示透明度。density为纹饰。

以上参数画图如下：

popABC.jpeg

可见，不同模型先验分布几乎是相同的（这是我们在prs文件里的设定）。而模型3的后验概率最高，说明3号模型是最佳模型（这时下结论太粗糙！还需要各种后续分析）。由使用手册中可查到，3居群的3号模型为((1,2),0)

挑选出.rej文件中所有3号模型的行；然后可以进一步对model3之下的各参数进行分析。

abc.rej[which(abc.rej[,1]==3),]

进阶内容：

1 实例的强行中止

设置一个较大的iteration让电脑一直运行着，觉得差不多了，就手动中止；在运行过程中，用notepad打开.dat文件，可以看看已经生成了多少个模拟数据；

对终端运行中的simmulate程序，按下快捷键ctrl+c强制中止程序；强制中止后，.dat文件可能最后一行不完整；需要重新用notepad打开.dat文件，删去最后一行；强制中止的话，对应的txt格式的log文件是空的，看不见parameter和summstats的数目；可以先完整跑一个iteration比较少的程序，生成一个txt的log. 只要除了ieration数目之外的参数是一样的，不同的模拟的log文件里各数目就是一样的。

2 dat文件太大，rejection程序无法分析

当dat文件太大时(几个G)，rejection.exe会报错如下：

error:: .dat file is too big to be analysed

这时，应把.dat文件打乱（如果是几个不同文件拼接的话）后拆分成若干份，分别运行rejection命令，然后把得到的.rej文件拼起来。

结果不符合预期时，该怎么办

2.1

summary(allsim[allsim[,1]==14,22:75])->temp
write.table(temp, file = 'model14.txt', quote = F, row.names = F, col.names = F)
n=21
hist(abc.rej[abc.rej[,1]==8,n])
hist(priors[abc.rej[,1]==8,n])

转自：xztswzm的博客　http://blog.sina.com.cn/xzwzm

2020-06-09