0_1 前言
有时间再详细写introduction。先把我分析数据的流程记录下来,后续持续更新。
0_2 用到的软件或脚本
测序数据质控:fastp;
测序数据比对:bowtie2;
比对数据排序等:samtools;
callPeaks:MACS3;
差异peaks寻找及作图:DiffBind;deeptools;
peaks注释:chipseeker;
motif寻找:
0_3 conda环境安装
有些同学使用的是公共服务器,可能没有root权限。
python3.8安装需一堆依赖包,MACS安装一堆依赖包,一不小心就报错, 没权限还装不了~~,酸爽滴狠!
便于安装和管理,所有用到软件统一用conda进行安装。
Anaconda下载地址:Anaconda
我个人推荐下载全功能Anaconda,不要下载miniconda,我在服务器上使用miniconda经常有软件冲突,找不出原因,可能缺少基础包,但最终原因应该是我太菜O_O,但用anaconda没碰到过问题。
#下载并安装conda
wget https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
bash Anaconda3-2021.05-Linux-x86_64.sh
#卸载conda
rm -rf ~/Anaconda
#卸载conda环境设置
rm -rf ~/.condarc ~/.conda ~/.continuum
#构建conda环境(重要)
#要做chipseq分析,我们可以构建一个名字叫chipseq的环境专门用来做分析
#同时指定python版本为3.8.10,R版本为4.1.0
conda create -n chipseq python=3.8.10 rbase=4.1.0
#激活环境(每次重启系统后都要使用该命令)
conda activate chipseq
1.质控及过滤
参考前文测序数据的质控和过滤。
本文中使用fastp一步到位对数进行质控、过滤并统计。
1.1 fastp安装
#方法一(推荐):
# note: the fastp version in bioconda may be not the latest
conda install -c bioconda fastp
#方法二:
# download binary (only for Linux systems) http://opengene.org/fastp/fastp
# this binary was compiled on CentOS, and tested on CentOS/Ubuntu
wget http://opengene.org/fastp/fastp
chmod a+x ./fastp
1.2 使用fastp对rawdata进行质控、过滤及统计
代码如下(chipseq数据不建议进行reads长度筛选,即不需要-l参数):
usage:fastp -i <in1> -o <out1> [-I <in1> -O <out2>] [options...]
#1 PE测序
fastp -i sample.R1.fastq.gz -o sample.R1.clean.fq.gz \
-I sample.R2.fastq.gz -O sample.R2.clean.fq.gz \
-5 --cut_front_window_size 4 --cut_front_mean_quality 20 \
-3 --cut_tail_window_size 4 --cut_tail_mean_quality 20 \
--cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 \
--detect_adapter_for_pe \
-q 15 -u 40 -e 20 -n 5 -l 30 -p -P 20 -w 4 -R "sample"
#2 SE测序
fastp -i sample.fq -o sample.clean.fq \
-5 --cut_front_window_size 4 --cut_front_mean_quality 20 \
-3 --cut_tail_window_size 4 --cut_tail_mean_quality 20 \
--cut_right --cut_right_window_size 4 \
--cut_right_mean_quality 20 \
-f 5 -q 15 -u 40 -e 20 -n 5 -p -P 20 -w 4 -R "sample"
#parameters
-i, --in1 ;-o, --out1 ; -I, --in2; -O, --out2
--unpaired1 for PE input, if read1 passed QC but read2 not, it will be written to unpaired1.
--unpaired2 for PE input, if read2 passed QC but read1 not, it will be written to unpaired2.
# adapter trimming options
-A, --disable_adapter_trimming
-a, --adapter_sequence
--adapter_sequence_r2
--adapter_fasta
--detect_adapter_for_pe by default, the adapter sequence auto-detection is enabled for SE data only, turn on this option to enable it for PE data.
# global trimming options
-f, --trim_front1 trimming how many bases in front for read1, default is 0 (int [=0])
-t, --trim_tail1 trimming how many bases in tail for read1, default is 0 (int [=0])
-b, --max_len1 if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1.
-F, --trim_front2 trimming how many bases in front for read2.
-T, --trim_tail2 trimming how many bases in tail for read2.
-B, --max_len2 if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2.
# per read cutting by quality options
-5, --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise.
-3, --cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise.
-r, --cut_right move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop.
-W, --cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
-M, --cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])
--cut_front_window_size the window size option of cut_front, default to cut_window_size if not specified (int [=4])
--cut_front_mean_quality the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20])
--cut_tail_window_size the window size option of cut_tail, default to cut_window_size if not specified (int [=4])
--cut_tail_mean_quality the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20])
--cut_right_window_size the window size option of cut_right, default to cut_window_size if not specified (int [=4])
--cut_right_mean_quality the mean quality requirement option for cut_right, default to cut_mean_quality if not specified (int [=20])
# quality filtering options
-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled
-q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])
-e, --average_qual if one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement (int [=0])
# length filtering options
-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled
-l, --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])
--length_limit reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])
# low complexity filtering
-y, --low_complexity_filter enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])
# filter reads with unwanted indexes (to remove possible contamination)
--filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])
--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])
--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])
# base correction by overlap analysis options
-c, --correction enable base correction in overlapped regions (only for PE data), default is disabled
--overlap_len_require the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default. (int [=30])
--overlap_diff_limit the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default. (int [=5])
--overlap_diff_percent_limit the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%. (int [=20])
# UMI processing
-U, --umi enable unique molecular identifier (UMI) preprocessing
--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])
--umi_len if the UMI is in read1/read2, its length should be provided (int [=0])
--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])
--umi_skip if the UMI is in read1/read2, fastp can skip several bases following UMI, default is 0 (int [=0])
# overrepresented sequence analysis
-p, --overrepresentation_analysis enable overrepresented sequence analysis.
-P, --overrepresentation_sampling One in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])
# reporting options
-j, --json the json format report file name (string [=fastp.json])
-h, --html the html format report file name (string [=fastp.html])
-R, --report_title should be quoted with ' or ", default is "fastp report" (string [=fastp report])
# threading options
-w, --thread worker thread number, default is 2 (int [=2])
# output splitting options
-s, --split split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
-S, --split_by_lines split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
-d, --split_prefix_digits the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])
# help
-?, --help print this message
1.3 fastp结果解读
#总共生成3个文件(PE测序),包含两个cleanreads文件及一个report文件
sample.R1.clean.fq
sample.R2.clean.fq
sample.report
#report包含reads数和base数以及 Q20、Q30等统计。
2.数据比对
完成数据过滤后或者clean data, 随后我们将clean data比对到参考基因组上用于后续分析。
在此我们用bowtie2进行比对。
2.1 参考基因组下载
参考基因组按需从UCSC下载。
Human:hg19; hg38; Mouse: mm10; mm39
#download hg19
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
#download hg38
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
#download mm10(mm38)
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/mm10.fq.gz
#download mm39
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm39/bigZips/mm39.fq.gz
2.2 bowtie2及samtools安装
#方法(推荐):
conda install -c bioconda bowtie2
conda install -c bioconda samtools
2.3 利用bowtie2构建索引
#以hg19为例
#建立索引目录,并将下载的hg19.fa(参考基因组)转移到该文件夹下或者直接下载到该目录下
mkdir ./hg19
cd hg19
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
bowtie2-build --threads 5 ./hg19.fa ./hg19
#--threads 线程数,根据需求来, 5核大约需要30min
#./hg19.fa 指定参考基因组
#./hg19 索引的存储位置及名字,即当前目录下,以hg19开头
#索引构建完成后,生成6个文件
# hg19.1.bt2
# hg19.2.bt2
# hg19.3.bt2
# hg19.4.bt2
# hg19.rev.1.bt2
# hhg19.rev.2.bt2
2.4 用bowtie2将质控过滤后的cleanreads比对到参考基因组
#usage: bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} -S [<hit>]
#PE测序
bowtie2 -p 6 -q \
-x /path/to/index/mm10/mm10 \
-1 sample.R1.clean.fq.gz \
-2 sample.R2.clean.fq.gz \
-S sample.sam
--local
#SE测序
bowtie2 -p 6 -q \
-x /path/to/index/mm10/mm10 \
-U sample.R1.clean.fq.gz \
-S sample.sam
--local
# -p/--threads NTHREADS 设置线程数. Default: 1
# -q reads 是 fastq 格式的
# -x <bt2-idx> index 路径
# -1 <m1> 双末端测序的 _1.fastq 路径。可以为多个文件,并用逗号分开;多个文件必须和 -2 <m2> 中制定的文件一一对应。
# -2 <m2> 双末端测序的 _2.fastq 路径.
# -U <r> 非双末端测序的 fastq 路径。可以为多个文件,并用逗号分开。
# -S <hit> 输出 Sam 格式文件。
# -3/--trim3 <int> 剪掉3'端<int>长度的碱基,再用于比对。(default: 0).
# 用fastqc看了看数据质量,发现3端质量有点问题,我就用了-3 5 --local参数,
# --local 如果fq文件是没有经过 trim 的,可以用局部比对执行 soft-clipping,加上参数--local 。该模式下对read进行局部比对, 从而, read 两端的一些碱基不比对,从而使比对得分满足要求.
2.5 比对完成后要进行排序sort并转换为bam文件
#对sam文件排序并输出为bam文件
samtools sort sample.sam -o sample.sorted.bam
2.6 统计插入片段长度fragment size并绘图
2.6.1 利用samtools直接提取信息
对于PE测序而言,bam文件的第9列直接存储了fragment length的信息,直接提取之后就可以用来分析画图,提取的代码如下:参考代码.
#提取fragement length 信息
samtools view sample.bam | \
awk -F '\t' 'function abs(x){return ((x < 0.0) ? -x : x)} {print $1"\t"abs($9)}' | \
sort | uniq | cut -f2 > sample.fragment.length.txt
#bam文件中每一行以reads为单位,这里去重(uniq)是为了避免来自同一个fragemnts的reads重复统计。
#提取好之后,用R画图就可以了。
#将下列代码写入sample.r文件中,然后用运行Rscript sample.r
#读取数据
data <- read.table("fragment.length.txt", header = F)
# 设置插入片段长度的阈值,过滤掉太长的片段,去掉长度为0的片段
length_cutoff <- 1200
fragment <- data$V1[data$V1 <= length_cutoff]
# 利用直方图统计频数分布,设置柱子个数
breaks_num <- 500
res <- hist(fragment, breaks = breaks_num, plot = FALSE)
# 添加坐标原点
plot(x = c(0, res$breaks),
y = c(0, 0, res$counts) / 10^2,
type = "l", col = "red",
xlab = "Fragment length(bp)",
ylab = expression(Normalized ~ read ~ density ~ 10^2))
###批量运算脚本(通过传递参数的方式传递样本名称)
#传递样本名称
args <- commandArgs(T)
sample1 <- args[1]
#文件名称
sample2 <- paste(sample1,".fragment.length.txt",sep="",collapse=NULL)
#图写入pdf文件,文件名为样本名称
pdf(file=paste(sample1,".pdf",sep="",collapse=NULL))
#读取数据
data <- read.table(sample2, header = F)
# 设置插入片段长度的阈值,过滤掉太长的片段,去掉长度为0的片段
length_cutoff <- 1200
fragment <- data$V1[data$V1 <= length_cutoff && data$V1 >0]
# 利用直方图统计频数分布,设置柱子个数
breaks_num <- 500
res <- hist(fragment, breaks = breaks_num, plot = FALSE)
# 添加坐标原点
plot(x = c(0, res$breaks),
y = c(0, 0, res$counts) / 10^2,
type = "l", col = "red",
xlab = "Fragment length(bp)",
ylab = expression(Normalized ~ read ~ density ~ 10^2))
#关闭图形设备
dev.off()
#运行脚本
Rscript sample.r sample
图如下:
2.6.2 利用deeptoos的bamPEFragmentSize统计片段长度
除此之外,还有picard的CollectInsertSizeMetrics,deeptools的bamPEFragmentSize也都可以计算插入片段长度。
### deeptoos统计bamPEFragmentSize
usage: bamPEFragmentSize [-h] [--bamfiles bam files [bam files ...]]
[--histogram FILE] [--plotFileFormat FILETYPE]
[--numberOfProcessors INT]
[--samplesLabel SAMPLESLABEL [SAMPLESLABEL ...]]
[--plotTitle PLOTTITLE]
[--maxFragmentLength MAXFRAGMENTLENGTH] [--logScale]
[--binSize INT] [--distanceBetweenBins INT]
[--blackListFileName BED file] [--table FILE]
[--outRawFragmentLengths FILE] [--verbose]
[--version]
###示例
bamPEFragmentSize -hist fragmentSize.png -T "Fragment size of PE-seq data" --maxFragmentLength 1000 \
-b testFiles/sample.bam testFiles/RNAseq_sample2.bam testFiles/RNAseq_sample3.bam testFiles/RNAseq_sample4.bam \
--samplesLabel sample1 sample2 sample3 sample4
###parameters
## --bamfiles, -b List of BAM files to process
## --histogram, -hist, -o
# Save a .png file with a histogram of the fragment length distribution.
## --plotFileFormat
# png, pdf, svg, eps, plotly
## --numberOfProcessors, -p
# Number of processors to use. The default is to use 1. (Default: 1)
## --samplesLabel
# Labels for the samples plotted. The default is to use the file name of the sample.
# The sample labels should be separated by spaces and quoted if a label itselfcontains a space E.g. –samplesLabel label-1 “label 2”
## --plotTitle, -T
# Title of the plot, to be printed on top of the generated image. Leave blank for no title. (Default: )
## --maxFragmentLength
# The maximum fragment length in the histogram. A value of 0 (the default) indicates to use twice the mean fragment length. (Default: 0)
## --logScale Plot on the log scale
## --binSize, -bs Length in bases of the window used to sample the genome. (Default: 1000)
## --distanceBetweenBins, -n
# To reduce the computation time, not every possible genomic bin is sampled.
# This option allows you to set the distance between bins actually sampled from.
# Larger numbers are sufficient for high coverage samples, while smaller values are useful for lower coverage samples.
# Note that if you specify a value that results in too few (<1000) reads sampled, the value will be decreased. (Default: 1000000)
## --blackListFileName, -bl
# A BED file containing regions that should be excluded from all analyses.
##--table
# In addition to printing read and fragment length metrics to the screen, write them to the given file in tabular format.
## --outRawFragmentLengths
# Save the fragment (or read if the input is single-end) length and their associated number of occurrences to a tab-separated file.
# Columns are length, number of occurrences, and the sample label.
3 calling peaks
测序数据比对完成后,要进行peaks分析。在这里推荐MACS3。
3.1 软件介绍
软件地址:https://github.com/macs3-project/MACS
安装介绍:https://github.com/macs3-project/MACS/blob/master/docs/INSTALL.md
软件依赖: Python3.6, 3.7 或3.8; Numpy (>=1.17); Cython>=0.29;cykhash;fermi-lite;simde。
3.2软件安装
#conda安装(截止本文,还只能下载macs2)
conda install -c bioconda macs2
#pip安装macs3
pip install macs3 #安装macs3,依赖包会自动安装,如上述numpy等
pip install --upgrade macs3 #用于旧版本macs3升级
macs3 callpeak参数见:https://github.com/macs3-project/MACS/blob/master/docs/callpeak.md
3.3 软件应用参考案例
#Examples:
#1. Peak calling for regular TF ChIP-seq:
$ macs3 callpeak -t ChIP.bam -c Control.bam -f BAM -g hs -n test -B -q 0.01
#2. Broad peak calling on Histone Mark ChIP-seq:
$ macs3 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1
#3. Peak calling on ATAC-seq (paired-end mode):
$ macs3 callpeak -f BAMPE -t ATAC.bam -g hs -n test -B -q 0.01
#4. Peak calling on ATAC-seq ( focusing on insertion sites, and using single-end mode):
$ macs3 callpeak -f BAM -t ATAC.bam -g hs -n test -B -q 0.01 --shift -50 --extension 100
3.4 软件应用实例
peaks calling:寻找可能的结合位点,即基因组中大量reads富集的区域。
详细参数见:https://github.com/macs3-project/MACS/blob/master/docs/callpeak.md
注意:
如果找重复间共有peaks,建议样本重复一起输入进行call peaks;
如果后续还要做重复间差异等等,建议单个单个重复进行比对
###重复一起输入找共有peaks
macs3 callpeak -t sample1.bam sample2.bam -c input1.bam input2.bam -f BAMPE -n sample --outdir ./ -g 1.87e9 -B
###重复分别进行比对
macs3 callpeak -t sample1.bam -c input1.bam -f BAMPE -n sample1 --outdir ./ -g 1.87e9 -B
macs3 callpeak -t sample2.bam -c input2.bam -f BAMPE -n sample2 --outdir ./ -g 1.87e9 -B
# -t treatment FILENAME,若多个重复,可以-t A B C
# -c control FILENAME,若多个重复,可以-c A B C
# -f 指定输入文件,我这是BAMPE格式,注意必须是sort后的bam.
# 可以是”ELAND”, “BED”, “ELANDMULTI”, “ELANDEXPORT”, “ELANDMULTIPET” (for pair-end tags),
# “SAM”, “BAM”, “BOWTIE”, “BAMPE” “BEDPE” 任意一个。如果不提供这项,就是自动检测选择。
# -n 生成的文件前缀名
# --outdir 结果文件存放位置,我这里是当前目录
# -g 参考基因组大小hs: 2.7e9; mm: 1.87e9; ce: 9e7; dm: 1.2e8, 不在其中的话,比如说拟南芥,就需要自己提供了。
# -B 会保存更多的信息在bedGraph文件中,如fragment pileup, control lambda, -log10pvalue and -log10qvalue scores
# -q: q值,也就是最小的PDR阈值, 默认是0.05。q值是根据p值利用BH计算,也就是多重试验矫正后的结果。
# -p: 这个是p值,指定p值后MACS2就不会用q值了。
# -m: 和MFOLD有关,而MFOLD和MACS预构建模型有关,默认是5:50,MACS会先寻找100多个peak区构建模型,一般不用改,因为你很大概率上不会懂。
3.5 结果文件
NAME_peaks.xls
是tab键分隔的peaks文件;
NAME_peaks.narrowPeak
是BED4文件,可直接导入UCSC genome browser查看;
NAME_summits.bed
是峰值文件,可用于后续寻找motif;
NAME_model.r
是r脚本,用于PE数据call peaks模型作图;
NAME_treat_pileup.bdg
和 NAME_control_lambda.bdg
是bedGraph 格式文件,可以导入UCSC genome browser 或者转换为bigWig文件进行可视化。两个文件分别对应处理组和对照组。
当设置-broad参数时:
NAME_peaks.broadPeak
同 NAME_peaks.narrowPeak
,除第10列不同
NAME_peaks.gappedPeak
同时包含narrow和broad peaks
详细解释如下
#结果文件,macs分析完成后,每个样品获得7个文件
brain_ac_peaks.xls
brain_ac_peaks.narrowPeak
brain_ac_summits.bed
brain_ac_model.r
brain_ac_treat_pileup.bdg
brain_ac_control_lambda.bdg
1. `NAME_peaks.xls` is a tabular file which contains information about called peaks.
Information include:
* chromosome name
* start position of peak
* end position of peak
* length of peak region
* absolute peak summit position
* pileup height at peak summit
* -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then this value should be 10)
* fold enrichment for this peak summit against random Poisson distribution with local lambda,
* -log10(qvalue) at peak summit
Coordinates in XLS is 1-based which is different from BED format.
When `--broad` is enabled for broad peak calling, the pileup, p-value, q-value,
and fold change in the XLS file will be the mean value across the entire peak region,
since peak summit wonot be called in broad peak calling mode.
2. `NAME_peaks.narrowPeak` is BED6+4 format file which contains the peak locations together with peak summit,
p-value, and q-value. You can load it to the UCSC genome browser. Definition of some specific columns are:
* 5th: integer score for display. It is calculated as `int(-10*log10pvalue)` or `int(-10*log10qvalue)`
depending on whether `-p` (pvalue) or `-q` (qvalue) is used as score cutoff.
Please note that currently this value might be out of the [0-1000] range defined in [UCSC ENCODE narrowPeak format]
(https://genome.ucsc.edu/FAQ/FAQformat.html#format12).
You can let the value saturated at 1000 (i.e. p/q-value = 10^-100) by using the following 1-liner
awk: `awk -v OFS="\t" '{$5=$5>1000?1000:$5} {print}' NAME_peaks.narrowPeak`
* 7th: fold-change at peak summit
* 8th: -log10pvalue at peak summit
* 9th: -log10qvalue at peak summit
* 10th: relative summit position to peak start
The file can be loaded directly to the UCSC genome browser.
Remove the beginning track line if you want to analyze it by other tools.
3. `NAME_summits.bed` is in BED format, which contains the peak summits locations for every peak.
The 5th column in this file is the same as what is in the `narrowPeak` file.
If you want to find the motifs at the binding sites, this file is recommended.
The file can be loaded directly to the UCSC genome browser.
Remove the beginning track line if you want to analyze it by other tools.
4. `NAME_peaks.broadPeak` is in BED6+3 format which is similar to `narrowPeak` file,
except for missing the 10th column for annotating peak summits.
This file and the `gappedPeak` file will only be available when `--broad` is enabled.
Since in the broad peak calling mode, the peak summit wonot be called,
the values in the 5th, and 7-9th columns are the mean value across all positions in the peak region.
Refer to `narrowPeak` if you want to fix the value issue in the 5th column.
5. `NAME_peaks.gappedPeak` is in BED12+3 format which contains both the broad region and narrow peaks.
The 5th column is the score for showing grey levels on the UCSC browser as in `narrowPeak`.
The 7th is the start of the first narrow peak in the region, and the 8th column is the end.
The 9th column should be RGB color key, however, we keep 0 here to use the default color, so change it if you want.
The 10th column tells how many blocks including the starting 1bp and ending 1bp of broad regions.
The 11th column shows the length of each block and
The 12th for the start of each block.
13th: fold-change,
14th: *-log10pvalue*,
15th: *-log10qvalue*.
The file can be loaded directly to the UCSC genome browser.
Refer to `narrowPeak` if you want to fix the value issue in the 5th column.
6. `NAME_model.r` is an R script which you can use to produce a PDF image of the model based on your data. Load it to R by:
`$ Rscript NAME_model.r`
Then a pdf file `NAME_model.pdf` will be generated in your current directory. Note, R is required to draw this figure.
7. The `NAME_treat_pileup.bdg` and `NAME_control_lambda.bdg` files are in bedGraph format which can be imported to
the UCSC genome browser or be converted into even smaller bigWig files.
The `NAME_treat_pielup.bdg` contains the pileup signals (normalized according to `--scale-to` option) from ChIP/treatment sample.
The `NAME_control_lambda.bdg` contains local biases estimated for each genomic location from the control sample,
or from treatment sample when the control sample is absent.
The subcommand `bdgcmp` can be used to compare these two files and make a bedGraph file of scores such as p-value,
q-value, log-likelihood, and log fold changes.
3.6 macs3寻找peaks的model
对于PE测序来说,测序reads是文库插入片段的两端,而实际上的peaks 峰应该是靠近中间位点,或者覆盖两个峰的中间区域,因此在后续做peaks分析时要构建峰值模型。个人是这样理解,后续我碰到详细解释的时候就贴上来。
###获取peaks model
Rscript brain_ac_model.r
4 DiffBind找差异peaks
DiffBind是鉴定两个样本间差异结合位点的一个R包。主要用于peak数据集,包括对peaks的重叠和合并的处理,计算peaks重复间隔的测序reads数,并基于结合亲和力鉴定具有统计显著性的差异结合位点。适用的统计模型有DESeq、DESeq2、edgeR。详细内容可参考DiffBind详细说明文档。
4.1 DiffBind安装
#方法一,在R中安装
library(BiocManager)
BiocManager::install("DiffBind")
#方法二,通过conda安装
conda install -c bioconda bioconductor-diffbind
4.2 样品表准备
共有11列:
1th SampleID 样品名称
2th Tissue 组织
3th Factor 因子
4th Condition 时空等条件
5th Treatment 分组
6th Replicate 实验重复
7th PeakCaller 寻找peaks的方法,支持多种方法,本文中用的是macs
8th bamPeaks 样品Peaks对应的比对后的bam文件,最好是绝对路径
9th ControlID 一般是input的名称
10th bamControl 对照组input对应的比对后的bam文件,最好是绝对路径
11th Peaks 通过MACS3寻找到的peaks,可以是 NAME_peaks.xls
文件,也可以是 NAME_peaks.narrowPeak
文件。具体表格如下(注意,样品表最好上传.txt格式的文件或者.csv格式的文件):
4.3 具体分析流程代码
建议根据实际情况选择或调整参数,本代码适用于本人分析流程,每一步骤我都详细进行了注释。
在windows下,可以直接一行行复制代码调试;
在linux下,将下列代码复制到名为DiffbindX.r(以.r结尾就行,其中X代表数字1,2,3等)的文件中,然后通过Rscript Diffbind.r 运行。
4.3.1获得用于后续分析的数据集
因为DiffBind前边分析中,有些步骤需要大量的运行时间和运行内存,如读取bam文件用于reads count和数据标准化过程。我们在调试代码或分析数据过程中不可能每次因为一点小的改变都要去重复运行,对于我这种租用学校服务器的穷人来说,不经济,十分不经济;对于时间宝贵的各位大佬来说,费时间,非常费时间。因此对于一些通用数据处理步骤,先写一个脚本全部完成,然后将需要的data保存为数据集,后续直接加载所需要的数据集就可以了。我们根据这个思路先将下列脚本保存为DiffBind1.r,用于数据集获得。
本研究中,因为我chip-seq做的全是h3k27me3,所以我读取数据时全用h3k27保存,大家可以根据自己的实验或者爱好调整。
setwd(./)
library(DiffBind)
###读取 peaksets中samples infromation,注意数据表路径
samples <- read.table("./samplesheet.txt",head=T)
samples
###转换为 dba object
h3k27 <- dba(sampleSheet=samples)
h3k27
###推荐将这个结果保存为data储存起来用于后续分析,后续可以直接读取数据,不需要上述步骤了
###存储的是原始peaks数据,没有通过reads数进行校正
save(h3k27,file="h3k27original.data")
###根据原始peaks数据(peak caller score)绘制样本间相关性热图,并存储为PDF
pdf("Figure 1 Correlation heatmap, using occupancy (peak caller score) data.pdf")
plot(h3k27)
dev.off()
###根据bam文件中reads数进行校正,这时候要读取bam文件,花费较多时间和内存。样本数越多,所需时间和内存就越多
###calculate a binding matrix with scores based on read counts for every sample (affinity scores)
h3k27 <- dba.count(h3k27)
h3k27
###推荐将这个结果保存为data储存起来用于后续分析,后续可以直接读取数据,不需要上述读取bam文件步骤,节约大量时间,避免重复运行
save(h3k27,file="h3k27count.data")
###根据原校正后peaks数据(affinity scores)绘制样本间相关性热图,并存储为PDF
pdf("Figure 2 Correlation heatmap, using affinity (read count) data.pdf")
plot(h3k27)
dev.off()
###统计peaks reads信息,便于后续标准化。
###每个样品peaks对应的总reads数比上每个样品library总reads数即为FRiP。
###Counting reads,library sizes and Fraction of Reads in Peaks(FRiP).
info <- dba.show(h3k27)
libsizes <- cbind(LibReads=info$Reads, FRiP=info$FRiP,PeakReads=round(info$Reads*info$FRiP))
rownames(libsizes) <- info$ID
libsizes
###libsizes包含4列信息如下
# > libsizes
# LibReads FRiP PeakReads
#BT4741 652697 0.16 104432
#BT4742 663370 0.15 99506
#MCF71 346429 0.31 107393
#MCF72 368052 0.19 69930
#MCF73 466273 0.25 116568
###推荐将这个结果保存为data储存起来用于后续分析,后续可以直接读取数据
save(libsizes,file="libsizes.data")
###对数据根据reads数进行标准化,然后储存为标准化后的对象h3k27
###Normalizing the data,library sizes
###only takes the library size (total number of aligned reads, or sequencing depth) into account
h3k27 <- dba.normalize(h3k27,normalize=DBA_NORM_LIB)
norm <- dba.normalize(h3k27, bRetrieve=TRUE)
norm
###norm包含了标准化的各种参数,如方法,每个样品的标准化因子,标准化对应的库以及每个样本库的大小,如下:
#$norm.method
#[1] "lib"
#$norm.factors
#[1] 0.8099610 0.8232056 0.4298993 0.4567323 0.5786191 0.4962278 1.8309087
#[8] 0.7652039 0.7361583 0.8771445 3.1959394
#$lib.method
#[1] "full"
#$lib.sizes
#[1] 652697 663370 346429 368052 466273 399879 1475415 616630 593224
#[10] 706836 2575408
#$filter.value
#[1] 1
###标准化后的库文件,
###The details of the normalized libsizes
normlibs <- cbind(FullLibSize=norm$lib.sizes,
NormFacs=norm$norm.factors,
NormLibSize=round(norm$lib.sizes/norm$norm.factors))
rownames(normlibs) <- info$ID
normlibs
###标准化后的库文件信息,可以看到经标准化后,所有库背景reads都变为一样了,即805838
#> normlibs
#FullLibSize NormFacs NormLibSize
#BT4741 652697 0.8099610 805838
#BT4742 663370 0.8232056 805838
#MCF71 346429 0.4298993 805838
#MCF72 368052 0.4567323 805838
#MCF73 466273 0.5786191 805838
###推荐将这些结果保存为data储存起来用于后续分析,后续可以直接读取数据,节约大量时间
save(norm,file="norm.data")
save(normlibs,file="normlibs.data")
save(h3k27,file="h3k27normlized.data")
4.3.2 加载保存的数据集用于后续计算
在4.3.1中,我们保存了大量数据集,在这一部分,我们加载这些数据集用于做后续的数据分析。
这部分代码保存在DiffBind2.r中。
setwd(./)
library(DiffBind)
###按照顺序或者需求加载数据集,顺序错乱会导致后边用的数据不正确
###如做Figure 1和2时首先加载第一项,用DiffBind1.r中的代码做Figure 1;然后再加载第二项,用DiffBind1.r中的代码做Figure 2
###如果做后续差异分析,建议按照顺序加载一遍。
load("h3k27original.data")
load("h3k27count.data")
load("libsizes.data")
load("norm.data")
load("normlibs.data")
load("h3k27normlized.data")
###先看一下重复间共有和特有peaks,我的是根据组来看的,我的数据有三个treatment,
###分别是CK,negative和positive;每个组有两种条件,每种条件有两个重复。
###这样的话,每种treatment就有4个样品,这4个样品做venn图就可以看共有和特有peaks。
###好像dba.plotVenn最多也就做4个样品的venn图了,再多就报错了,不知道后续会不会升级
###masks是DBA object包含的信息,创建DBA object时,会自动产生masks,每个条目是一个masks
###如H3k27的masks就有各个treatment、conditions等等,可以用names(h3k27$masks)查看
###在windows中,用h3k27$masks$按tab健可以看有哪些masks
###peaks among replicates or different conditions,figure 3
pdf("Figure 3 peaks among replicats.pdf")
dba.plotVenn(h3k27,h3k27$masks$CK)
dba.plotVenn(h3k27,h3k27$masks$Neg)
dba.plotVenn(h3k27,h3k27$masks$Pos)
dev.off()
###接下来是最重要的步骤
###构建contrast,即差异分析分组,这直接关系到后边所有作图数据,所以这一步极其关键
###Establishing a model design and contrast, used for differential analysis
###基本语法:
# dba.contrast(DBA, group1, group2=!group1, name1="group1", name2="group2",
# minMembers=3, block, bNot=FALSE,
# categories=c(DBA_TISSUE,DBA_FACTOR,DBA_CONDITION,DBA_TREATMENT))
#-------------------------------构建contrast,第一种情况,整体分析---------------------------------------------#
###本研究中我们按照数据表中Treatment构建contrast,因为只有两个重复,所以设置最少样本为2,无重复不能做
h3k27 <- dba.contrast(h3k27,
minMembers=2,
categories = DBA_TREATMENT)
h3k27
###h3k27有三个Treatment,所以有三种contrast,对应着不同差异分析分组
# 1. CK vs Neg
# 2. CK vs Pos
# 3. Neg vs Pos
###这样我们就可以根据不同contrast编号进行后续作图了,如不同分组的差异peaks 火山图、venn图和热图等等。
###但是有个bug,后续做Profie时默认只做第一个contrast的,非常非常的不合理,盼望后续R包升级改进,按照contrastID进行作图
#--------------------------------构建contrast,第二种情况,分组分析---------------------------------------------#
###鉴于以上bug,当分组多,我们在做contrast时,分别进行处理,一组做一个contrast,即做一组差异
###当做一个新的contrast时,需把旧的contrast清空,运行如下命令
h3k27$contrasts=NULL
h3k27 <- dba.contrast(h3k27,
group1=h3k27$masks$CK,
group2=h3k27$masks$Pos,
name1="CK",
name2="Pos",
minMembers=2)
##语法一:以上为CK和Pos组对比,其它组对比可以参考这个
h3k27 <- dba.contrast(h3k27,
categories=DBA_CONDITION,
block=h3k27$masks$CK,
#block=list(c(3,4,5,8,9),c(1,2,10,11)
minMembers=2)
##语法二:以上为按照数据表中Condition策略进行分组时,仅分析Treatment为CK的部分(#或者仅分析数据表中指定数据行)
#------------------------------------------构建contrast,over----------------------------------------------------#
###构建contrast完成后,根据contrast进行差异分析。
###切记,contrast很重要,每次转换contrast时都要清空旧的contrast
### 差异分析步骤Performing the differential analysis based on contrast
###可以分别选择三种方法,也可以同时选择,方法有edgeR,DESeq2等
h3k27 <- dba.analyze(h3k27,method=DBA_ALL_METHODS)
###显示比对信息,包括分组、样品数目、分析方法、差异peaks数量等
###如果设置方法为DBA_ALL_METHODS,会有两组数据,分别为DESeq2和edgeR的
dba.show(h3k27, bContrasts=TRUE)
###单contrast、单方法分析(举例而已,样品不是本研究样品)
# Factor Group Samples Group2 Samples2 DB.DESeq2
#1 Condition Resistant 4 Responsive 7 246
###多contrast、多方法分析(举例而已,样品不是本研究样品)
# Factor Group Samples Group2 Samples2 DB.edgeR DB.DESeq2
#1 Tissue Brain 3 CN 3 22495 16041
#2 Tissue Brain 3 NPC 3 30228 24881
#3 Tissue Brain 3 ESC 3 93654 91723
#4 Tissue CN 3 NPC 3 34165 27994
#5 Tissue CN 3 ESC 3 90362 87404
#6 Tissue ESC 3 NPC 3 84192 80557
###contrast数据差异分析完成后,我们就可以输出差异数据及愉快的作图了
###输出差异数据output the Differentially Binding data
##多contrast、多方法分析,可以根据contrast的ID及方法输出数据,如下
comp1.edgR<- dba.report(h3k27, method=DBA_EDGER, contrast = 1 )
comp1.deseq2<- dba.report(h3k27, method=DBA_DESEQ2, contrast = 1)
head(comp1.deseq2)
# 输出EdgeR分析的差异数据,以congtrast (Neg-Pos组)为例
out <- as.data.frame(comp1.edgR)
write.table(out, file="./Neg_vs_Pos_edgeR.xls", sep="\t", quote=F, col.names = NA)
#输出 DESeq2分析的差异数据,以congtrast (Neg-Pos组)为例
out <- as.data.frame(comp1.deseq2)
write.table(out, file="./Neg_vs_Pos_deseq2.xls", sep="\t", quote=F, col.names = NA)
###差异数据绘图,draw pictures for Differentially Binding data
###plotting Figure 4根据差异数据做相关性热图,可整体绘图,也可根据contrast绘图。
pdf("Figure 4 Correlation heatmap, using differential binding data.pdf")
plot(h3k27)
plot(h3k27, contrast=1)
plot(h3k27, contrast=2)
plot(h3k27, contrast=3)
dev.off()
###plotting Figure 5,根据差异数据做venn图。
pdf("Figure 5 plotVenn for different binding statis.pdf")
dba.plotVenn(h3k27, contrast=1, bDB=TRUE, bGain=TRUE, bLoss=TRUE, bAll=FALSE,main="D0 vs D5N")
dba.plotVenn(h3k27, contrast=2, bDB=TRUE, bGain=TRUE, bLoss=TRUE, bAll=FALSE,main="D0 vs D5P")
dba.plotVenn(h3k27, contrast=3, bDB=TRUE, bGain=TRUE, bLoss=TRUE, bAll=FALSE,main="D5N vs D5P")
dev.off()
###plotting Figure 6,根据差异数据做PCA图。
pdf("Figure 6 plotPCA for different binding statis.pdf")
dba.plotPCA(h3k27, contrast=1, label=DBA_TREATMENT)
dba.plotPCA(h3k27, contrast=2, label=DBA_TREATMENT)
dba.plotPCA(h3k27, contrast=3, label=DBA_TREATMENT)
dev.off()
###plotting Figure 7,根据差异数据做MA图。
### MA plots are a useful way to visualize the relationship between the overall binding level at
### each site and the magnitude of the change in binding enrichment between conditions, as well
### as the effect of normalization on data.
pdf("Figure 7 plotMA for different binding statis.pdf")
dba.plotMA(h3k27,contrast=1)
dba.plotMA(h3k27,contrast=2)
dba.plotMA(h3k27,contrast=3)
dev.off()
###plotting Figure 8,根据差异数据做火山图。
pdf("Figure 8 plotvolvano for different binding statis.pdf")
dba.plotVolcano(h3k27,contrast=1)
dba.plotVolcano(h3k27,contrast=2)
dba.plotVolcano(h3k27,contrast=3)
dev.off()
###plotting Figure 9,根据差异数据做箱线图。
pdf("Figure 9 plotbox for different binding statis.pdf")
dba.plotBox(h3k27,contrast=1)
dba.plotBox(h3k27,contrast=2)
dba.plotBox(h3k27,contrast=3)
dev.off()
###plotting Figure 10,根据差异数据做热图。
pdf("Figure 10 heatmap for different binding statis.pdf")
hmap <- colorRampPalette(c("red", "black", "green"))(n = 13)
dba.plotHeatmap(h3k27, scale="row", colScheme = hmap)
dba.plotHeatmap(h3k27, contrast=1, correlations=FALSE,scale="row", colScheme = hmap)
dba.plotHeatmap(h3k27, contrast=2, correlations=FALSE,scale="row", colScheme = hmap)
dba.plotHeatmap(h3k27, contrast=3, correlations=FALSE,scale="row", colScheme = hmap)
dev.off()
###plotting Figure 11,根据差异数据做plotProfile。
###注意,如果有多个contrast,默认只做第一个contrast的图;
###所以,如果想做不同的contrast,建议如前文提到,重新做目标contrast,然后dba.analyze,最后做profile图。
pdf("Figure 11 CK_Neg_plotProfile for different binding data.pdf")
profiles <- dba.plotProfile(h3k27, doPlot=TRUE)
dba.plotProfile(profiles)
dev.off()
###plotting Figure 12,根据另一个contrast差异数据做plotProfile。
h3k27$contrasts=NULL
h3k27 <- dba.contrast(h3k27,
group1=h3k27$masks$CK,
group2=h3k27$masks$Neg,
name1="CK",
name2="Neg",
minMembers=2)
h3k27 <- dba.analyze(h3k27,method=DBA_ALL_METHODS)
pdf("Figure 12 CK_Neg_plotProfile for different binding data.pdf")
profiles <- dba.plotProfile(h3k27, doPlot=TRUE)
dba.plotProfile(profiles)
dev.off()
###不同样品中共有peaks分析,即overlap分析。
###perform overlap analysis
olap.rate <- dba.overlap(h3k27,mode=DBA_OLAP_RATE)
olap.rate
#结果显示如下,共11个样品,下列数据分别代表共同在1-11个样品中出现的peaks数量
#[1] 3795 2845 1773 1388 1074 817 653 484 384 202 129
###plotting Figure 13,overlap peaks in all samples
pdf("Figure 13 overlap ioverlap peaks in all samples.pdf")
plot(olap.rate,type='b',ylab='# peaks', xlab='Overlap at least this many peaksets')
dev.off()
关于DiffBind,我还没有完全吃透,如很多种标准化方法、多因素分析等。后续用的上的时候再去琢磨,希望各位大佬能多提一些意见。
5 寻找重复间共有peaks (引自微信公众号《生信技能树》 )
ATAC-Seq要求必须有2次或更多次生物学重复(十分珍贵或者稀有样本除外,但必须做至少2次技术重复)。理论上重复样本的peaks应该有高度的一致性,实际情况并不完全与预期一致。如何评价重复样本的重复性的好坏?如何得到一致性的peaks?这里参考微信公众号《生信技能树》的文章重复样本的处理。
5.1 利用bedtools寻找重复间共有peaks
如何得到两个重复样本间一致性的peaks? 一种简单粗暴的方法就是用bedtools计算peaks的overlaps。
### bedtools install
conda install -c bioconda bedtools
usage:bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>
# '-a': 参数后加重复样本1(A)
# '-b': 参数后加重复样本2(B),也可以加多个样本
#其他常用参数解释和图解如下:
# '-wo':Write the original A and B entries plus the number of base pairs of overlap between the two features.
# '-wa':Write the original entry in A for each overlap.
# '-v':Only report those entries in A that have no overlaps with B
# 如果只有-a和-b参数,返回的是相对于A的overlaps。
# 加上参数-wo返回A和B原始的记录加上overlap的记录。
# 参数-wa返回每个overlap中A的原始记录。
###使用代码示例
bedtools intersect \
-a macs3/Nanog-rep1_peaks.narrowPeak \
-b macs3/Nanog-rep2_peaks.narrowPeak \
-wo > bedtools/Nanog-overlaps.bed
5.2 利用IDR寻找重复间共有peaks
The IDR (Irreproducible Discovery Rate) framework is a unified approach to measure the reproducibility of findings identified from replicate experiments and provide highly stable thresholds based on reproducibility. Unlike the usual scalar measures of reproducibility, the IDR approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. In layman's terms, the IDR method compares a pair of ranked lists of identifications (such as ChIP-seq peaks). These ranked lists should not be pre-thresholded i.e. they should provide identifications across the entire spectrum of high confidence/enrichment (signal) and low confidence/enrichment (noise). The IDR method then fits the bivariate rank distributions over the replicates in order to separate signal from noise based on a defined confidence of rank consistency and reproducibility of identifications i.e the IDR threshold.
参考使用方法
评估重复样本间peaks一致性的另一种方法是IDR。IDR是通过比较一对经过排序的regions/peaks 的列表,然后计算反映其重复性的值。IDR在ENCODE和modENCODE项目中被广泛使用,也是ChIP-seq指南和标准中的一部分。
IDR的 优点:
- 避免了初始阈值的选择,解决了不同callers的不可比较性
- IDR不依赖于阈值的选择,所有regions/peaks都被考虑在内。
- 它是依赖regions/peaks的排序,不要求对输入信号进行校准或标准化
IDR使用注意事项:
- 建议使用IDR时,MACS2 call peaks的步骤参数设置不要过于严格,以便鉴定出更多的peaks。
- 使用IDR需要先对MACS2的结果文件narrowPeak根据-log10(p-value)进行排序。
###install IDR
conda install -c bioconda idr
###Sort peak by -log10(p-value)
sort -k8,8nr Sample_peaks.narrowPeak > sample_peaks.sorted.narrowPeak
### usage(参数非常非常多,大部分默认就好)
usage: idr [-h] --samples SAMPLES SAMPLES [--peak-list PEAK_LIST] [--input-file-type {narrowPeak,broadPeak,bed,gff}] [--rank RANK] [--output-file OUTPUT_FILE] [--output-file-type {narrowPeak,broadPeak,bed}] [--log-output-file LOG_OUTPUT_FILE]
[--idr-threshold IDR_THRESHOLD] [--soft-idr-threshold SOFT_IDR_THRESHOLD] [--use-old-output-format] [--plot] [--use-nonoverlapping-peaks] [--peak-merge-method {sum,avg,min,max}] [--initial-mu INITIAL_MU] [--initial-sigma INITIAL_SIGMA]
[--initial-rho INITIAL_RHO] [--initial-mix-param INITIAL_MIX_PARAM] [--fix-mu] [--fix-sigma] [--dont-filter-peaks-below-noise-mean] [--use-best-multisummit-IDR] [--allow-negative-scores] [--random-seed RANDOM_SEED] [--max-iter MAX_ITER]
[--convergence-eps CONVERGENCE_EPS] [--only-merge-peaks] [--verbose] [--quiet] [--version]
###使用示例
idr --samples sample_Rep1.sorted.narrowPeak sample_Rep2.sorted.narrowPeak \
--input-file-type narrowPeak \
--rank signal.value \
--output-file sample-idr \
--plot \
--log-output-file sample.idr.log
###parameters
# --samples:narrowPeak的输入文件(重复样本,貌似只能分析两个重复间的共有peaks)
# --input-file-type:输入文件格式包括narrowPeak,broadPeak,bed
# --rank RANK Which column to use to rank peaks.
# Options: signal.value p.value q.value columnIndex
# Defaults:
# narrowPeak/broadPeak: signal.value
# bed: score
# --output-file: 输出文件路径
# --plot:输出IDR度量值的结果
# --log-output-file 记录共有peaks比例
结果文件解读
###输出文件包括:
sample-idr
sample-idr.log
sample-idr.png
##### (1)sample-idr
# sample-idr是common peaks的结果输出文件,格式与输入文件格式类似,只是多了几列信息。
# 前10列是标准的narrowPeak格式文件,包含重复样本整合后的peaks信息。
# col1: chrom string
# col2: chromStart int
# col3: chromEnd int
# col4: name string
# col5: score int 包含缩放的 IDR 值,
# 如min(int(log2(-125IDR), 1000),那么IDR=0,缩放的IDR就是1000;
# IDR=0.05, int(-125log2(0.05)) = 540;
# IDR=1.0, int(-125log2(1.0) = 0。
# col6: strand [+-.] Use '.' if no strand is assigned.
# col7: signalValue float
# col8: p-value float
# col9: q-value float
# col10: summit int
# col11: local IDR值 float -log10(Local IDR value)
# col12: global IDR值 float -log10(Global IDR value)
# global IDR值是第5列中用于计算缩放的IDR值,类似于对p值进行多个假设校正以计算FDR;
# local IDR类似于属于不可重复噪声部分的峰值的后验概率。
# 第13-16列:是重复样本1相关的信息,
# col13: rep1_chromStart int
# col14: rep1_chromEnd int
# col15: rep1_signalValue float
# col16: rep1_summit int
# 第17-20列:是重复样本2相关的信息,
# col17: rep1_chromStart int
# col18: rep1_chromEnd int
# col195: rep1_signalValue float
# col20: rep1_summit int
### 计算下common peaks的个数,接着可再计算下与总peaks的比率。
wc -l *-idr
### 如果想看IDR<0.05的,可以通过第5列信息过滤:
awk '{if($5 >= 540) print $0}' sample-idr | wc -l
#####(2)sample-idr.log
#log文件会给出peaks通过IDR < 0.05的比率,如下所示
# Initial parameter values: [0.10 1.00 0.20 0.50]
# Final parameter values: [2.37 1.70 0.97 0.76]
# Number of reported peaks - 14048/14048 (100.0%)
# Number of peaks passing IDR cutoff of 0.05 - 9874/14048 (70.3%)
#####(3)sample-idr.png
# png文件包括4个图,如下图所示:
# 左上: Rep1 peak ranks vs Rep2 peak ranks, 没有通过特定IDR阈值的peaks显示为红色。
# 右上:Rep1 log10 peak scores vs Rep2 log10 peak scores,没有通过特定IDR阈值的peaks显示为红色。
# 下面两个图: Peak rank vs IDR scores,箱线图展示了IDR值的分布,默认情况下,IDR值的阈值为-1E-6。
6 数据可视化
这一部分主要是将bam文件通过一定的标准转换为bigwig文件(.bw),可以用IGV等软件进行可视化,同时获得其他一些数据。
6.1 使用deepTools进行可视化
这一步骤主要用到deepTools。
deeptools提供bamCoverage和bamCompare进行格式转换,为了能够比较不同的样本,需要对先将基因组分成等宽分箱(bin),设置的binsize越小,分辨率(resolution)越高。统计每个分箱的read数,最后得到描述性统计值。对于两个样本,描述性统计值可以是两个样本的比率,或是比率的log2值,或者是差值。如果是单个样本,可以用SES方法进行标准化
### install deepTools
conda install -c bioconda deeptools
###使用deepTools中bamCoverage将sort后的bam文件转换为bigwig文件
bamCoverage -bs 10 --normalizeUsing RPGC --effectiveGenomeSize 2689660000 --extendReads -b sample.bam -o sample.bw
##parameters
# -e/--extendReads 拓展原来的read长度
# -bs binsize,越小分辨率越高
# --normalizeUsing 标准化方法,见后文
# --effectiveGenomeSize 提供基因组大小,我这里提供的是mm39
# -b BAM file to process
# -o Output file name.
# -of outFileFormat,Possible choices: bigwig, bedgraph
# --filterRNAstrand {forward, reverse}: 仅统计指定正链或负链
# --region/-r CHR:START:END: 选取某个区域统计
# --smoothLength: 通过使用分箱附近的read对分箱进行平滑化
# 如果为了其他结果进行比较,还需要进行标准化,deeptools提供了如下参数:
# --scaleFactor: 缩放系数
# `--normalizeUsingRPKMReads``: Per Kilobase per Million mapped reads (RPKM)标准化
# --normalizeTo1x: 按照1x测序深度(reads per genome coverage, RPGC)进行标准化
# --ignoreForNormalization: 指定那些染色体不需要经过标准化
##常用参数解释
# -bs binSize, Size of the bins, in bases, for the output of the bigwig/bedgraph file. (Default: 50)
# -r Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time.
# The format is chr:start:end, for example –region chr10 or –region chr10:456700:891000.
# -bl blackListFileName, A BED or GTF file containing regions that should be excluded from all analyses.
# -p NumberOfProcessors
# --effectiveGenomeSize The effective genome size is the portion of the genome that is mappable.
# --normalizeUsing Possible choices: RPKM, CPM, BPM, RPGC, None
# Use one of the entered methods to normalize the number of reads per bin. By default, no normalization is performed.
# RPKM = Reads Per Kilobase per Million mapped reads;
# CPM = Counts Per Million mapped reads, same as CPM in RNA-seq;
# BPM = Bins Per Million mapped reads, same as TPM in RNA-seq;
# RPGC = reads per genomic content (1x normalization);
# Mapped reads are considered after blacklist filtering (if applied).
# RPKM (per bin) = number of reads per bin / (number of mapped reads (in millions) * bin length (kb)).
# CPM (per bin) = number of reads per bin / number of mapped reads (in millions).
# BPM (per bin) = number of reads per bin / sum of all reads per bin (in millions).
# RPGC (per bin) = number of reads per bin / scaling factor for 1x average coverage.
# None = the default and equivalent to not setting this option at all.
# This scaling factor, in turn, is determined from the sequencing depth: (total number of mapped reads * fragment length) / effective genome size.
# The scaling factor used is the inverse of the sequencing depth computed for the sample to match the 1x coverage.
# This option requires –effectiveGenomeSize.
# Each read is considered independently, if you want to only count one mate from a pair in paired-end data,
# then use the –samFlagInclude/–samFlagExclude options. (Default: None)
# -ignore --ignoreForNormalization, A list of space-delimited chromosome names containing those
# chromosomes that should be excluded for computing the normalization.
# --smoothLength The smooth length defines a window, larger than the binSize, to average the number of reads.
# -e --extendReads,This parameter allows the extension of reads to fragment size.
# chipseq示例
bamCoverage --bam a.bam -o a.SeqDepthNorm.bw --binSize 10 --normalizeUsing RPGC \
--effectiveGenomeSize 2150570000 --ignoreForNormalization chrX --extendReads
# This is an example for ChIP-seq data using additional options (smaller bin size for higher resolution,
# normalizing coverage to 1x mouse genome size,
# excluding chromosome X during the normalization step, and extending reads):
6.2 使用expander可视化CHIP-seq的peaks结果
EXPANDER (EXpression Analyzer and DisplayER) is a java-based tool for analysis of gene expression and NGS data. It seamlessly integrates in one package all analysis steps, including:
Expander 8.0 user manual in PDF format
Expander 8.0 user manual in HTML format
- Data preprocessing and normalization
- " Identification of differential genes (including methods suitable for NGS data analysis)
- Clustering and biclustering;
- Down-stream enrichment analyses of:
- GO functional categories
- TF binding sites in promoter regions
- MicroRNA sites in 3'-UTRs and
- Biological pathways and
- Chromosomal locations
- Network-based analysis of the expression data.
EXPANDER provides support for analysis of human, mouse, rat, chicken, fly, zebrafish, C. elegans, S. cerevisiae, S. pombe, arabidopsis, tomato, rice, listeria, A. fumigatus and E. coli.
### download expander
### 需要注意的是必须要人机验证才能下载,然后验证码一直没有出现,我以为是浏览器的问题,
### 先后换了google chrome,firefox,edge,internet explorer,360急速浏览器都不行,最后VPN穿墙才出现的验证码,极其不友好,吐槽一下,我不知道是不是只有我这样
http://acgt.cs.tau.ac.il/expander/downloads.html
### install expander
服务器上java太麻烦,后边再尝试;windows 可能方便一些~~~
7 peaks 注释
7.1 利用deeptools分析TSS附近信号强度
7.1.1 制作基因组TSS注释bed文件
要分析转录起始位点(TSS, Transscription start site)信号强度,需要从注释文件中将TSS位置信息提取出来。
### 下载参考基因组注释信息,根据图7.1.1 设置
https://genome.ucsc.edu/cgi-bin/hgTables
###下载的文件名为mm39.RefSeq.bed.gz,解压后获得mm39.RefSeq.bed
gunzip mm39.RefSeq.bed.gz
### head mm39.RefSeq.bed
##bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
#1 NM_001159711 chr1 + 16758679 16779829 16758731 16779761 5 16758679,16761907,16771072,16776396,16779662, 16758843,16761997,16771147,16776449,16779829, 0 Ly96 cmpl cmpl 0,1,1,1,0,
#1 NM_016923 chr1 + 16758679 16779829 16758731 16779761 5 16758679,16761907,16771018,16776396,16779662, 16758843,16761997,16771147,16776449,16779829, 0 Ly96 cmpl cmpl 0,1,1,1,0,
#1 NM_175642 chr1 - 25106556 25868788 25107248 25865841 31 25106556,25113765,25123324,25132845,25137956,25140513,25150772,25151257,25156369,25165721,25167899,25170331,25260840,25265800,25267501,25435516,25459638,25470503,25471507,25496072,25499823,25527005,25543513,25543993,25570950,25571555,25586496,25592755,25598835,25865084,25868542, #25107437,25113870,25123360,25133489,25137996,25140609,25150944,25151356,25156455,25165788,25167969,25170482,25260943,25265904,25267628,25435560,25459741,25470579,25471657,25496181,25499892,25527200,25543620,25544095,25571115,25571720,25586661,25592917,25598946,25865856,25868788, 0 Adgrb3 cmpl cmpl 0,0,0,1,0,0,2,2,0,2,1,0,2,0,2,0,2,1,1,0,0,0,1,1,1,1,1,1,1,0,-1,
#1 NM_008922 chr1 - 33492888 33708875 33493166 33708092 14 33492888,33503133,33519538,33523762,33551096,33553225,33572376,33667475,33669418,33701326,33702857,33706125,33707938,33708704, 33493385,33503202,33519621,33523889,33551282,33553298,33572444,33667613,33669514,33701447,33702937,33706229,33708101,33708875, 0 Prim2 cmpl cmpl 0,0,1,0,0,2,0,0,0,2,0,1,0,-1,
#1 NM_001347056 chr1 - 58697279 58731604 58697466 58731572 14 58697279,58698154,58698381,58700558,58705985,58706887,58708586,58709528,58715687,58717444,58719022,58730071,58730863,58731459, 58697593,58698264,58698471,58700621,58706089,58706987,58708639,58709626,58715749,58717538,58719156,58730117,58730941,58731604, 0 Flacc1 cmpl cmpl 2,0,0,0,1,0,1,2,0,2,0,2,2,0,
#1 NM_175370 chr1 - 58697279 58735148 58697582 58731572 15 58697279,58698154,58698381,58700558,58705985,58706887,58708586,58709528,58715687,58717444,58719022,58730071,58730863,58731459,58735060, 58697589,58698264,58698471,58700621,58706089,58706987,58708639,58709626,58715749,58717538,58719156,58730117,58730941,58731604,58735148, 0 Flacc1 cmpl cmpl 2,0,0,0,1,0,1,2,0,2,0,2,2,0,-1,
#2 NM_001311090 chr1 - 125795358 125839124 125801072 125838161 3 125795358,125838127,125839039, 125801308,125838265,125839124, 0 Lypd1 cmpl cmpl 1,0,-1,
#2 NM_145100 chr1 - 125795358 125839951 125801072 125839627 3 125795358,125838127,125839575, 125801308,125838265,125839951, 0 Lypd1 cmpl cmpl 1,1,0,
#2 NM_001311089 chr1 - 125795358 125840665 125801072 125838161 3 125795358,125838127,125840610, 125801308,125838265,125840665, 0 Lypd1 cmpl cmpl 1,0,-1,
###提取TSS上下游各2.5K区域位置信息存储到mm39.refseq.tss.bed
perl -alne '{next if /^#/;if($F[3] eq "+"){$start=$F[4]-2500;$end=$F[4]+2500}else{$start=$F[5]-2500;$end=$F[5]+2500}print join("\t",$F[2],$start,$end,$F[12],0,$F[3])}' mm39.RefSeq.bed |sort -u >mm39.refseq.tss.bed
########重要#################
###得到的信息如下,注意有些基因有多个TSS,这里是TSS上下游2.5Kb的信息,一定要注意是不是自己的需求
#如果后边想看TSS开始的,建议脚本修改start为$F[4],不加也不减任何数值;因为bed默认start为起始位点;
#如果提取的为TSS上下游各2.5bp信息,建议后边使用center做matrix
# head mm39.refseq.tss.bed
#chr10 100169037 100174037 Gm4301 0 +
#chr10 100169037 100174037 Gm4312 0 +
#chr10 100174197 100179197 Gm4302 0 +
#chr10 100174199 100179199 Gm4303 0 +
#chr10 100174199 100179199 Gm4305 0 +
#chr10 100174199 100179199 Gm4307 0 +
#chr10 100179331 100184331 Gm4302 0 +
#chr10 100179333 100184333 Gm4303 0 +
#chr10 100179333 100184333 Gm4305 0 +
#chr10 100179333 100184333 Gm4307 0 +
###注意以上脚本存储的是TSS 上下游2.5Kb的信息,如果只取TSS点,可以用如下脚本
awk -F "\t" 'NR!=1{print $3"\t"$5"\t"$5"\t"$13"\t"0"\t"$4}' mm39.RefSeq.bed >ucsc.rmm39efseq.tss.bed
###删除重复行
我自己写的perl脚本,用hash去除;不会的这一步可以用excel去重复~~
7.1.2 利用deeptools中的computeMatrix获得样本TSS位置score矩阵
computeMatrix is a tool that calculates scores per genome regions and prepares an intermediate file that can be used with plotHeatmap and plotProfiles. Typically, the genome regions are genes, but any other regions defined in a BED file can be used. computeMatrix accepts multiple score files (bigWig format) and multiple regions files (BED format). This tool can also be used to filter and sort regions according to their score.
Reference-point refers to a position within a BED region (e.g., the starting point). In this mode, only those genomicpositions before (upstream) and/or after (downstream) of the reference point will be plotted.
### 利用computeMatrix获得样本TSS位置score矩阵
computeMatrix reference-point \
--referencePoint TSS -p 5 -b 2500 -a 2500 \
-R mm39.refseq.tss.bed -S sample.bw \
--skipZeros -o matrix1.sample.TSS.gz --outFileSortedRegions regions.sample.bed
### 其中matrix1.sample.TSS.gz就是用于后续作图的数据
### 利用plotHeatmap作图,如下文图
plotHeatmap -m matrix1.sample.TSS.gz -out sample.Heatmap.png
###computeMatrix reference-point parameters
## --referencePoint
# Possible choices: TSS, TES, center;(Default: “TSS”)
## --beforeRegionStartLength, -b, --upstream
# Distance upstream of the reference-point selected. (Default: 500)
## --afterRegionStartLength, -a, --downstream
# Distance downstream of the reference-point selected. (Default: 1500)
## --regionsFileName, -R
# File name or names, in BED or GTF format, containing the regions to plot.
## --scoreFileName, -S
# bigWig file(s) containing the scores to be plotted.
## --skipZeros
# Whether regions with only scores of zero should be included or not. Default is to include them.
## --outFileName, -out, -o
# File name to save the gzipped matrix file needed by the “plotHeatmap” and “plotProfile” tools.
## --outFileNameMatrix
# If this option is given, then the matrix of values underlying the heatmap will be saved using the indicated name,
# e.g. IndividualValues.tab.This matrix can easily be loaded into R or other programs.
## --outFileSortedRegions
# File name in which the regions are saved after skiping zeros or min/max threshold values.
# The order of the regions in the file follows the sorting order selected.
# This is useful, for example, to generate other heatmaps keeping the sorting of the first heatmap.
# Example: Heatmap1sortedRegions.bed
## --numberOfProcessors, -p
# Number of processors to use. (Default: 1)
### other options
## --nanAfterEnd
# If set, any values after the region end are discarded.
# This is useful to visualize the region end when not using the scale-regions mode and when the reference-point is set to the TSS.
## --binSize, -bs
# Length, in bases, of the non-overlapping bins for averaging the score over the regions length. (Default: 10)
## --sortRegions
# Possible choices: descend, ascend, no, keep
## --sortUsing
# Possible choices: mean, median, max, min, sum, region_length (Default: “mean”)
## --missingDataAsZero
# If set, missing data (NAs) will be treated as zeros.
## --minThreshold
# Numeric value. Any region containing a value that is less than or equal to this will be skipped.
# This is useful to skip, for example, genes where the read count is zero for any of the bins.
# This could be the result of unmappable areas and can bias the overall results. (Default: None)
## --maxThreshold
# Numeric value. Any region containing a value greater than or equal to this will be skipped.
# The maxThreshold is useful to skip those few regions with very high read counts (e.g. micro satellites) that may bias the average values. (Default: None)
## --blackListFileName, -bl
# A BED file containing regions that should be excluded from all analyses.
# Currently this works by rejecting genomic chunks that happen to overlap an entry.
# Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered.
## --samplesLabel
# Labels for the samples. This will then be passed to plotHeatmap and plotProfile. The default is to use the file name of the sample.
# The sample labels should be separated by spaces and quoted if a label itselfcontains a space E.g. –samplesLabel label-1 “label 2”
## --smartLabels
# Instead of manually specifying labels for the input bigWig and BED/GTF files, this causes deepTools to use the file name after removing the path and extension.
## --quiet, -q
# Set to remove any warning or processing messages.
## --scale
# If set, all values are multiplied by this number. (Default: 1)
7.2 利用deeptools中的multiBamSummary和 plotCorrelation计算bamfile的相关性
既可以根据read account来做相关性分析,也可以根据bigwig文件中的score来做
### 为bam文件构建index
samtools index sample.bam
###usage multiBamSummary
multiBamSummary bins --bamfiles file1.bam file2.bam -o results.npz
###示例
multiBamSummary bins --bamfiles ../2_alignment/*.sort.bam --minMappingQuality 30 -p 5 -out readCounts.npz --outRawCounts readCounts.tab
###parameters
## --bamfiles, -b
# List of indexed bam files separated by spaces.
## --outFileName, -out, -o
# File name to save the coverage matrix. This matrix can be subsequently plotted using plotCorrelation or or plotPCA.
###usage plotCorrelation
usage: plotCorrelation [-h] --corData FILE --corMethod {spearman,pearson}
--whatToPlot {heatmap,scatterplot} [--plotFile FILE]
[--skipZeros]
[--labels sample1 sample2 [sample1 sample2 ...]]
[--plotTitle PLOTTITLE] [--plotFileFormat FILETYPE]
[--removeOutliers] [--version]
[--outFileCorMatrix FILE] [--plotHeight PLOTHEIGHT]
[--plotWidth PLOTWIDTH] [--zMin ZMIN] [--zMax ZMAX]
[--colorMap] [--plotNumbers] [--xRange XRANGE XRANGE]
[--yRange YRANGE YRANGE] [--log1p]
###示例heatmap
plotCorrelation --corData readCounts.npz --corMethod pearson --skipZeros --whatToPlot heatmap \
--plotTitle "Heatmap-Pearson Correlation of Average Scores" \
-o heatmap_PearsonCorr_bamScores.png \
--colorMap RdYlBu --plotNumbers \
--outFileCorMatrix PearsonCorr_bamScores.tab \
--removeOutliers
###示例scatterplot
plotCorrelation --corData readCounts.npz --corMethod pearson --skipZeros --whatToPlot scatterplot \
--plotTitle "Scatterplot-Pearson Correlation of Average Scores" \
-o scatterplot_PearsonCorr_bamScores.png \
--outFileCorMatrix scatPearsonCorr_bamScores.tab
###parameters
## --corData, -in
# Compressed matrix of values generated by multiBigwigSummary or multiBamSummary
## --corMethod, -c
# Possible choices: spearman, pearson. Correlation method.
## --whatToPlot, -p
# Possible choices: heatmap, scatterplot. Choose between a heatmap or pairwise scatter plots
7.3 网页版peaks注释
主要用到GREAT网页
GREAT predicts functions of cis-regulatory regions.
我们可以看到,该网站可以做Peaks注释分析,输入文件是bed格式(chrN:XXXX-XXXX)物种主要做人和小鼠。
人的参考基因组支持到hg19和hg38;而小鼠基因组仅支持到mm9和mm10;
我现在比对用得到的是mm39;所以在上传bed文件时,我必须做不同基因组位置信息的转换,将mm39的位置信息转换为mm10。这一步主要用到UCSC的Lift Genome Annotations。如果你的参考基因组没问题,可以不用转换。
转换用到的bed文件,我们可以从3.5中NAME_peaks.narrowPeak获取前三列信息
### 从narrowPeak中获得peaks基因组位置bed信息
awk -F "\t" '{print $1"\t"$2"\t"$3}' NAME_peaks.narrowPeak >NAME.mm39.peaks.bed
得到的NAME.peaks.bed可以上传到网页版liftover,手动转换
也可以在服务器上下载liftover脚本,通过命令行转换。转换的chain可以从参考基因组下载位置去下载
### liftOver下载
wget (http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
###权限更改
chmod u+x lifterOver
###Usage: Move annotations from one assembly to another
liftOver oldFile map.chain newFile unMapped
# oldFile and newFile are in bed format by default, but can be in GFF and maybe eventually others with the appropriate flags below.
# The map.chain file has the old genome as the target and the new genome as the query.
###chain下载,我的是从mm39转换为mm10
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm39/liftOver/mm39ToMm10.over.chain.gz
### 更改权限并解压
chmod 777 mm39ToMm10.over.chain.gz
gunzip mm39ToMm10.over.chain.gz
###转换
liftOver NAME.mm39.peaks.bed mm39ToMm10.over.chain.gz NAME.mm10.peaks.bed unmapped
将最后获得的NAME.mm10.peaks.bed输入到GREAT网页 里边去就可以了。
7.4 利用R包ChIPseeker注释peaks
ChIPseeker: ChIP peak Annotation, Comparison, and Visualization
This package implements functions to retrieve the nearest genes around the peak, annotate genomic region of the peak, statstical methods for estimate the significance of overlap among ChIP peak data sets, and incorporate GEO database for user to compare their own dataset with those deposited in database. The comparison can be used to infer cooperative regulation and thus can be used to generate hypotheses. Several visualization functions are implemented to summarize the coverage of the peak experiment, average profile and heatmap of peaks binding to TSS regions, genomic annotation, distance to TSS, and overlap of peaks or genes.
这是典型大佬一言不合就自己写R包(我用起来不满意给你提意见,但是你拒绝采纳,那我自己写吧~~)
ChIPpeakAnno WAS the only R package for ChIP peak annotation. I used it for annotating peak in my recent study. I found it does not consider the strand information of genes. I reported the bug to the authors, but they are reluctant to change. So I decided to develop my own package, ChIPseeker, and it’s now available in Bioconductor. -------------by Guangchuang Yu
###通过conda安装r-magick,后续有包个ggimage依赖这个包
conda install -c conda-forge r-magick
### 安装R包
library(BiocManager)
BiocManager::install("ChIPseeker")
BiocManager::install("TxDb.Mmusculus.UCSC.mm39.refGene") ##根据自己需求下载参考基因组信息
BiocManager::install("ggplot2")
BiocManager::install("ggupset")
BiocManager::install("ggplotify")
BiocManager::install("ggimage")
###加载R包
library(ChIPseeker)
library(TxDb.Mmusculus.UCSC.mm39.refGene)
library(ggplot2)
library(ggupset)
library(ggplotify)
library(ggimage)
###读取数据
peak.gr <- readPeakFile("~/3_macs/sample_peaks.narrowPeak", as="GRanges")
#使用的是3.5中通过macs3得到的NAME_peaks.narrowPeak文件,as是指读取后转化成的对象
###加载基因组注释文件
txdb <-TxDb.Mmusculus.UCSC.mm39.refGene
###peaks在基因组上的分布
pdf("Figure1.sample_distribution of peaks in chrs.pdf")
covplot(peak.gr, weightCol="V5")
#covplot(peak.gr, weightCol="V5", chrs=c("chr17", "chr18"), xlim=c(4.5e7, 5e7)) #chr17 chr18上的peaks
###多个文件同时画图
pdf("多个文件同时画染色体分布图")
peak=GenomicRanges::GRangesList(CBX6=readPeakFile("~/3_macs/sample1_peaks.narrowPeak", as="GRanges"),CBX7=readPeakFile("~/3_macs/sample2_peaks.narrowPeak", as="GRanges"))
covplot(peak, weightCol="V5") + facet_grid(chr ~ .id)
dev.off()
###TSS区域,上下游2.5kb,可自行设置
pdf("Figure2.sample_TSS_peaks_heatmap and plot.pdf")
##热图
tss <- getPromoters(TxDb=txdb, upstream=2500, downstream=2500)
tagMatrix <- getTagMatrix(peak.gr, windows=tss)
#tagHeatmap(tagMatrix, xlim=c(-2500, 2500), color="red")
peakHeatmap(peak.gr, TxDb=txdb, upstream=2500, downstream=2500, color="red") #等效语句
##峰图
plotAvgProf(tagMatrix, xlim=c(-2500, 2500),
xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")
#含置信区间峰图
plotAvgProf(tagMatrix, xlim=c(-2500, 2500), conf = 0.95, resample = 1000)
dev.off()
###peaks注释,tss选取上下游2.5Kb,可自行调整,这里默认promoter是上游2.5Kb
peakAnno <- annotatePeak(peak.gr, tssRegion=c(-2500, 2500), TxDb=txdb, annoDb="org.Mm.eg.db")
peakAnno
#在注释时,有的peak可能同时落在两个或者更多的gene feature里(例如是一个基因的外显子而同时又是另一个基因的内含子),但只能注释其中一个。
#默认按照Promoter、5’ UTR、3’ UTR、Exon、Intron、Downstream、Intergenic顺序先后注释。
#
### 保存注释结果
peakAnno.df <- as.data.frame(peakAnno) #保存为数据框
peakAnno.gr <- as.GRanges(peakAnno) #保存为GenomicRanges
write.table(peakAnno.df,file="sample.annotation.xls",quote=F,sep="\t")
###peaks注释可视化(多种图形及最近基因)
pdf("Figure 3.sample_peaks annotation_pie_bar_venn_upset.pdf")
plotAnnoPie(peakAnno)
plotAnnoBar(peakAnno)
vennpie(peakAnno)
upsetplot(peakAnno, vennpie=TRUE)
dev.off()
8 Motif查找
motif是比较有特征的短序列,会多次出现的,一般认为它的生物学意义重大,做完CHIP-seq分析之后,一般都会寻找motif 。
查找有两种:
一种是de novo的,要求的输入文件的fasta序列,一般是根据peak的区域的坐标提取好序列;
另一种是依赖于数据库的搜寻匹配,将现有的ChIP-seq数据进行整合,提供更全面,更准确的motif数据库。
目前最常用的工具是meme
目前版本:Release of MEME Suite 5.3.3
### 软件下载及安装,根据自己的需求选择安装位置
wget https://meme-suite.org/meme/meme-software/5.3.3/meme-5.3.3.tar.gz
tar -zxvf meme-5.3.3.tar.gz
cd meme-5.3.3/
./configure --prefix=$HOME/bin/meme --with-url="https://meme-suite.org/meme"
make
make test
make install
###记得将安装好的meme加入PATH,便于调用
### 数据库下载,根据自己的需求选择存储位置
## Motif Databases (updated 14 Dec 2020)
wget https://meme-suite.org/meme/meme-software/Databases/motifs/motif_databases.12.21.tgz
## GOMo Databases (updated 17 Feb 2015)
wget https://meme-suite.org/meme/meme-software/Databases/gomo/gomo_databases.3.2.tgz
## T-Gene Databases (updated 12 Oct 2019)
wget https://meme-suite.org/meme/meme-software/Databases/tgene/tgene_databases.1.0.tgz
### 根据bed文件提取fasta序列(其实有bed文件,有参考基因组,几行perl脚本就搞定了,但要普适性,所以看下边吧)
##利用bedtools提取fasta文件
usage:bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf> -fo <fasta>
##Options:
# -fi Input FASTA file
# -bed BED/GFF/VCF file of ranges to extract from -fi
# -fo Output file (can be FASTA or TAB-delimited)
# -name Use the name field for the FASTA header
# -split given BED12 fmt., extract and concatenate the sequencesfrom the BED "blocks" (e.g., exons)
# -tab Write output in TAB delimited format.
# - Default is FASTA format.
# -s Force strandedness. If the feature occupies the antisense,
# strand, the sequence will be reverse complemented.
# - By default, strand information is ignored.
# -fullHeader Use full fasta header.
# - By default, only the word before the first space or tab is used.
## 从7.3 部分获取bed文件前三列,我再把代码贴一遍
## 从narrowPeak中获得peaks基因组位置bed信息
awk -F "\t" '{print $1"\t"$2"\t"$3}' NAME_peaks.narrowPeak >NAME.mm39.peaks.bed
##fasta示例
bedtools getfasta -fi mm39.fa -bed NAME.mm39.peaks.bed -name -fo NAME.fa
### 获得的文件可以上传meme网站,如下图,可以在线获得结果
###可通过代码运行
meme D0ac10_1.fa -dna -oc ./ -nostatus -time 14400 -mod zoops -nmotifs 3 -minw 6 -maxw 50 -objfun classic -revcomp -markov_order 0
# Usage: meme <dataset> [optional arguments]
# <dataset> file containing sequences in FASTA format
# [-h] print this message
# [-o <output dir>] name of directory for output files will not replace existing directory
# [-oc <output dir>] name of directory for output files will replace existing directory
# [-text] output in text format (default is HTML)
# [-objfun classic|de|se|cd|ce] objective function (default: classic)
# [-test mhg|mbn|mrs] statistical test type (default: mhg)
# [-use_llr] use LLR in search for starts in Classic mode
# [-neg <negdataset>] file containing control sequences
# [-shuf <kmer>] preserve frequencies of k-mers of size <kmer> when shuffling (default: 2)
# [-hsfrac <hsfrac>] fraction of primary sequences in holdout set (default: 0.5)
# [-cefrac <cefrac>] fraction sequence length for CE region (default: 0.25)
# [-searchsize <ssize>] maximum portion of primary dataset to use for motif search (in characters)
# [-maxsize <maxsize>] maximum dataset size in characters
# [-norand] do not randomize the order of the input sequences with -searchsize
# [-csites <csites>] maximum number of sites for EM in Classic mode
# [-seed <seed>] random seed for shuffling and sampling
# [-dna] sequences use DNA alphabet
# [-rna] sequences use RNA alphabet
# [-protein] sequences use protein alphabet
# [-alph <alph file>] sequences use custom alphabet
# [-revcomp] allow sites on + or - DNA strands
# [-pal] force palindromes (requires -dna)
# [-mod oops|zoops|anr] distribution of motifs
# [-nmotifs <nmotifs>] maximum number of motifs to find
# [-evt <ev>] stop if motif E-value greater than <evt>
# [-time <t>] quit before <t> seconds have elapsed
# [-nsites <sites>] number of sites for each motif
# [-minsites <minsites>] minimum number of sites for each motif
# [-maxsites <maxsites>] maximum number of sites for each motif
# [-wnsites <wnsites>] weight on expected number of sites
# [-w <w>] motif width
# [-minw <minw>] minimum motif width
# [-maxw <maxw>] maximum motif width
# [-allw] test starts of all widths from minw to maxw
# [-nomatrim] do not adjust motif width using multiple alignment
# [-wg <wg>] gap opening cost for multiple alignments
# [-ws <ws>] gap extension cost for multiple alignments
# [-noendgaps] do not count end gaps in multiple alignments
# [-bfile <bfile>] name of background Markov model file
# [-markov_order <order>] (maximum) order of Markov model to use or create
# [-psp <pspfile>] name of positional priors file
# [-maxiter <maxiter>] maximum EM iterations to run
# [-distance <distance>] EM convergence criterion
# [-prior dirichlet|dmix|mega|megap|addone] type of prior to use
# [-b <b>] strength of the prior
# [-plib <plib>] name of Dirichlet prior file
# [-spfuzz <spfuzz>] fuzziness of sequence to theta mapping
# [-spmap uni|pam] starting point seq to theta mapping type
# [-cons <cons>] consensus sequence to start EM from
# [-brief <n>] omit sites and sequence tables in
# output if more than <n> primary sequences
# [-nostatus] do not print progress reports to terminal
# [-p <np>] use parallel version with <np> processors
# [-sf <sf>] print <sf> as name of sequence file
# [-V] verbose mode
# [-version] display the version number and exit