数据处理
VCF转为 rrBLUP {-1,0,1} 格式
rrBLUP可识别的基因型格式为 {-1,0,1} (行头为marker,列为sample),因此需要对基本数据处理转换;
编码G矩阵计算时, 有不同的编码形式,如下:
- 0,1,2; 即AA是0, 表示major基因, 1 表示杂合, 2表示aa(minor).
- -1, 0, 1; 即-1是AA, 表示major基因型, 0表示杂合, 1表示aa(minor).
## vcftools 生成{ 0,1,2} 矩阵
vcftools --vcf test.genotypes_no_missing_IDs.vcf --012 --out snp_matrix
- --012
This option outputs the genotypes as a large matrix. Three files are produced. The first, with suffix ".012", contains the genotypes of each individual on a separate line. Genotypes are represented as 0, 1 and 2, where the number represent that number of non-reference alleles. Missing genotypes are represented by -1. The second file, with suffix ".012.indv" details the individuals included in the main file. The third file, with suffix ".012.pos" details the site locations included in the main file.
##R
data<-as.matrix(read.table("snp_matrix.012",header = F))
data1<-data[,-c(1)] #去列名
data2 <- data1 - 1 #0,1,2 转-1,0,1
write.table(mydata2, file="SNP_TMP.txt", row.names=FALSE, col.names=FALSE)#保存文件为纯数字的txt格式
##shell
cat SNP_TMP.txt | sed 's/-2/NA/g' > snp.txt
文件输入
示例文件:
traits.txt: https://pbgworks.org/sites/pbgworks.org/files/traits.txt
snp.txt: https://pbgworks.org/sites/pbgworks.org/files/snp.txt
Pheno <- as.matrix(read.table(file ="/data4/ykzhang/chip_207/7GS/rrblup/format/sheep207_mvp.txt", header=TRUE))
Markers <- as.matrix(read.table(file="/data4/ykzhang/chip_207/7GS/rrblup/format/snp.txt"), header=F)
数据过滤和填充
impute = A.mat(Markers,max.missing=0.5,impute.method="mean",return.imputed=T)#按50%缺失值过滤,并按均值填充
Markers_impute2 = impute$imputed
简单交叉验证
traits=1
cycles=300
accuracy = matrix(nrow=cycles, ncol=traits)
for(r in 1:cycles){
train= as.matrix(sample(1:207, 180))
test<-setdiff(1:207,train)
Pheno_train=Pheno[train,]
m_train=Markers_impute2[train,]
Pheno_valid=Pheno[test,]
m_valid=Markers_impute2[test,]
yield=Pheno_train[,7]
yield_answer<-mixed.solve(yield, Z=m_train, K=NULL, SE = FALSE, return.Hinv=FALSE)
pred_yield_valid = m_valid %*% as.matrix(yield_answer$u)
pred_yield=pred_yield_valid[,1]+yield_answer$beta
yield_valid = Pheno_valid[,7]
accuracy[r,1] <-cor(pred_yield_valid, yield_valid, use="complete" )
}
mean(accuracy)
资料:
Introduction to Genomic Selection in R using the rrBLUP Package
【GS专栏】8-全基因组选择实战之RRBLUP