可以先使用intron做参考基因组比对再取。看一下结果
如果用取sam的方式,同一条序列只要看左右有没有超出intron位置即可。
这样可以使用索引,问题是,比对算法中的索引是怎么起作用的?查看一下.fai文件。
改造igv
问题集中在取序列上:
1.取intron内的
2.取配对的
是先写出各部分的再组装还是改原来取序列的代码
conda安装好软件后需要先conda activate 进环境后才能使用。即使计入了path 有的还是需要
gffread翻译
gffread v0.12.1. Usage:
gffread <input_gff> [-g <genomic_seqs_fasta> | <dir>][-s <seq_info.fsize>]
[-o <outfile>] [-t <trackname>] [-r [[<strand>]<chr>:]<start>..<end> [-R]]
[-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
[-i <maxintron>] [--stream] [--bed] [--table <attrlist>] [--sort-by <ref.lst>]
Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
transcripts (exon or CDS) and more.
过滤转化或者聚类GFF GTF BED记录,提取转录本?序列等【主要功能】
By default (i.e. without -O) only transcripts are processed, discarding any
other non-transcript features. Default output is a simplified GFF3 with only
the basic attributes.
默认(除非用-O)只处理转录本,丢弃非转录本信息,默认输出GFF3.简化版
<input_gff> is a GFF file, use '-' for stdin
输如的GFF文件,用-作为
Options:
-i discard transcripts having an intron larger than <maxintron> 丢弃有intron长大于输入参数的转录本
-l discard transcripts shorter than <minlen> bases 丢弃碱基数小于输入参数的转录本
-r only show transcripts overlapping coordinate range <start>..<end>
(on chromosome/contig <chr>, strand <strand> if provided) 只展示重叠坐标区的起止
-R for -r option, discard all transcripts that are not fully
contained within the given range 丢弃包括不完全在给定区域的转录本
-U discard single-exon transcripts
-C coding only: discard mRNAs that have no CDS features 丢弃没CDS特征的
--nc non-coding only: discard mRNAs that have CDS features
--ignore-locus : discard locus features and attributes found in the input
-A use the description field from <seq_info.fsize> and add it
as the value for a 'descr' attribute to the GFF record 使用来自 <seq_info.fsize>的描述字段添加到GFF文件
-s <seq_info.fsize> is a tab-delimited file providing this info
for each of the mapped sequences:
<seq-name> <seq-length> <seq-description>
(useful for -A option with mRNA/EST/protein mappings)
Sorting: (by default, chromosomes are kept in the order they were found) 默认不排序
--sort-alpha : chromosomes (reference sequences) are sorted alphabetically 按字母排序
--sort-by : sort the reference sequences by the order in which their
names are given in the <refseq.lst> file
Misc options:
-F preserve all GFF attributes (for non-exon features) 保留所有GFF属性
--keep-exon-attrs : for -F option, do not attempt to reduce redundant
exon/CDS attributes 不要试图减少冗余exon/CDS属性
-G do not keep exon attributes, move them to the transcript feature
(for GFF3 output) 移除exon属性
--keep-genes : in transcript-only mode (default), also preserve gene records
--keep-comments: for GFF3 input/output, try to preserve comments 保留注释
-O process other non-transcript GFF records (by default non-transcript
records are ignored)
-V discard any mRNAs with CDS having in-frame stop codons (requires -g)
-H for -V option, check and adjust the starting CDS phase
if the original phase leads to a translation with an
in-frame stop codon
-B for -V option, single-exon transcripts are also checked on the
opposite strand (requires -g)
-P add transcript level GFF attributes about the coding status of each
transcript, including partialness or in-frame stop codons (requires -g)
--add-hasCDS : add a "hasCDS" attribute with value "true" for transcripts
that have CDS features
--adj-stop stop codon adjustment: enables -P and performs automatic
adjustment of the CDS stop coordinate if premature or downstream
-N discard multi-exon mRNAs that have any intron with a non-canonical
splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)
-J discard any mRNAs that either lack initial START codon
or the terminal STOP codon, or have an in-frame stop codon
(i.e. only print mRNAs with a complete CDS)
--no-pseudo: filter out records matching the 'pseudo' keyword
--in-bed: input should be parsed as BED format (automatic if the input
filename ends with .bed*)
--in-tlf: input GFF-like one-line-per-transcript format without exon/CDS
features (see --tlf option below); automatic if the input
filename ends with .tlf)
--stream: fast processing of input GFF/BED transcripts as they are received
((no sorting, exons must be grouped by transcript in the input data)
Clustering:
-M/--merge : cluster the input transcripts into loci, discarding
"duplicated" transcripts (those with the same exact introns
and fully contained or equal boundaries)
-d <dupinfo> : for -M option, write duplication info to file <dupinfo>
--cluster-only: same as -M/--merge but without discarding any of the
"duplicate" transcripts, only create "locus" features
-K for -M option: also discard as redundant the shorter, fully contained
transcripts (intron chains matching a part of the container)
-Q for -M option, no longer require boundary containment when assessing
redundancy (can be combined with -K); only introns have to match for
multi-exon transcripts, and >=80% overlap for single-exon transcripts
-Y for -M option, enforce -Q but also discard overlapping single-exon
transcripts, even on the opposite strand (can be combined with -K)
Output options:
--force-exons: make sure that the lowest level GFF features are considered
"exon" features 确保最低的GFF特征被考虑
--gene2exon: for single-line genes not parenting any transcripts, add an
exon feature spanning the entire gene (treat it as a transcript)
--t-adopt: try to find a parent gene overlapping/containing a transcript
that does not have any explicit gene Parent
-D decode url encoded characters within attributes
-Z merge very close exons into a single exon (when intron size<4)将非常近的外显子合并为一个
-g full path to a multi-fasta file with the genomic sequences
for all input mappings, OR a directory with single-fasta files
(one per genomic sequence, with file names matching sequence names)
-w write a fasta file with spliced exons for each transcript 对每个转录本写一个fa文件,有剪切外显子
--w-add <N> for the -w option, extract additional <N> bases
both upstream and downstream of the transcript boundaries 提取附加的上下游N个碱基
-x write a fasta file with spliced CDS for each GFF transcript
-y write a protein fasta file with the translation of CDS for each record
-W for -w and -x options, write in the FASTA defline all the exon
coordinates projected onto the spliced sequence;
-S for -y option, use '*' instead of '.' as stop codon translation
-L Ensembl GTF to GFF3 conversion (implies -F; should be used with -m)
-m <chr_replace> is a name mapping table for converting reference
sequence names, having this 2-column format:
<original_ref_ID> <new_ref_ID>
-t use <trackname> in the 2nd column of each GFF/GTF output line
-o write the records into <outfile> instead of stdout
-T main output will be GTF instead of GFF3
--bed output records in BED format instead of default GFF3
--tlf output "transcript line format" which is like GFF
but exons, CDS features and related data are stored as GFF
attributes in the transcript feature line, like this:
exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>
<exons> is a comma-delimited list of exon_start-exon_end coordinates;
<CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>
--table output a simple tab delimited format instead of GFF, with columns
having the values of GFF attributes given in <attrlist>; special
pseudo-attributes (prefixed by @) are recognized:
@id, @geneid, @chr, @start, @end, @strand, @numexons, @exons,
@cds, @covlen, @cdslen
If any of -w/-y/-x FASTA output files are enabled, the same fields
(excluding @id) are appended to the definition line of corresponding
FASTA records
-v,-E expose (warn about) duplicate transcript IDs and other potential
problems with the given GFF/GTF records
GFFREAD看来达不到目的,从bedtools上看到有getfast的subcommand可以达到目的
bedtools安装:
之前不在conda的base环境中装过,现在在base也能装。经验之一是:将conda的bin加入到path之后,bin下的软件都载入到path里了,这个经验非常重要。
参数:
Usage: bedtools <subcommand> [options]
The bedtools sub-commands include:
[ Genome arithmetic ]
intersect Find overlapping intervals in various ways.
window Find overlapping intervals within a window around an interval.
closest Find the closest, potentially non-overlapping interval.
coverage Compute the coverage over defined intervals.
map Apply a function to a column for each overlapping interval.
genomecov Compute the coverage over an entire genome.
merge Combine overlapping/nearby intervals into a single interval.
cluster Cluster (but don't merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
shift Adjust the position of intervals.
subtract Remove intervals based on overlaps b/w two files.
slop Adjust the size of intervals.
flank Create new intervals from the flanks of existing intervals.
sort Order the intervals in a file.
random Generate random intervals in a genome.
shuffle Randomly redistribute intervals in a genome.
sample Sample random records from file using reservoir sampling.
spacing Report the gap lengths between intervals in a file.
annotate Annotate coverage of features from multiple files.
[ Multi-way file comparisons ]
multiinter Identifies common intervals among multiple interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.
[ Paired-end manipulation ]
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.
[ Format conversion ]
bamtobed Convert BAM alignments to BED (& other) formats.
bedtobam Convert intervals to BAM records.
bamtofastq Convert BAM records to FASTQ records.
bedpetobam Convert BEDPE intervals to BAM records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
[ Fasta manipulation ]
getfasta Use intervals to extract sequences from a FASTA file.
maskfasta Use intervals to mask sequences from a FASTA file.
nuc Profile the nucleotide content of intervals in a FASTA file.
[ BAM focused tools ]
multicov Counts coverage from multiple BAMs at specific intervals.
tag Tag BAM alignments based on overlaps with interval files.
[ Statistical relationships ]
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
reldist Calculate the distribution of relative distances b/w two files.
fisher Calculate Fisher statistic b/w two feature files.
[ Miscellaneous tools ]
overlap Computes the amount of overlap from two intervals.
igv Create an IGV snapshot batch script.
links Create a HTML page of links to UCSC locations.
makewindows Make interval "windows" across a genome.
groupby Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
expand Replicate lines based on lists of values in columns.
split Split a file into multiple files with equal records or base pairs.
summary Statistical summary of intervals in a file.
[ General help ]
--help Print this help menu.
--version What version of bedtools are you using?.
--contact Feature requests, bugs, mailing lists, etc.
出错
malformed畸形的
entry 入口 条目
GFF3有问题,转bed试试
方法是:https://www.bbsmax.com/A/LPdo6Oy253/
第一步出错:
这个报错比较明显,回头检查GFF3文件
在反向序列这里,star确实比end大,说明其默认的star end是左右,只按一个顺序。
看一下原文件
即使-也是左小右大。
看来需要改一下GFF3,暂时先用以前的intron.fa
改一下顺序以后,流程应该是:GFF3 直接取intron,fa,用bedtools
先用以前的:
没索引
但即使跑出来,也只能做为验证,正式的步骤应该是sam按intron取,sam取互补的另一条
跑出来之后效果并不好
不但特别大,而且文件内:
还是不用intron做参考基因组比较好。
查找算法://www.greatytc.com/p/95d224c4d13e
常见查找算法:
[1. 顺序查找]
[2. 二分查找] [3. 插值查找] [4. 斐波那契查找]
[5. 树表查找]
[6. 分块查找]
[7. 哈希查找]
说是七种,其实二分查找、插值查找以及斐波那契查找都可以归为一类——插值查找。插值查找和斐波那契查找是在二分查找的基础上的优化查找算法。
之前想用二分查找,实现虽然不难,但是想实现以下更复杂的算法,因此想使用分块(索引)、哈希的查找方式。
分块查找
分块查找又称索引顺序查找,它是顺序查找的一种改进方法。
算法思想:将n个数据元素"按块有序"划分为m块(m ≤ n)。每一块中的结点不必有序,但块与块之间必须"按块有序";即第1块中任一元素的关键字都必须小于第2块中任一元素的关键字;而第2块中任一元素又都必须小于第3块中的任一元素,……
算法流程:
- step1 先选取各块中的最大关键字构成一个索引表;
- step2 查找分两个部分:先对索引表进行二分查找或顺序查找,以确定待查记录在哪一块中;然后,在已确定的块中用顺序法进行查找。
建索引表的思想有助于理解比对算法
哈希查找
哈希表是一个在时间和空间上做出权衡的经典例子。如果没有内存限制,那么可以直接将键作为数组的索引。那么所有的查找时间复杂度为O(1);如果没有时间限制,那么我们可以使用无序数组并进行顺序查找,这样只需要很少的内存。哈希表使用了适度的时间和空间来在这两个极端之间找到了平衡。只需要调整哈希函数算法即可在时间和空间上做出取舍。
基因组数据可视化工具
格式:
chr1 start end strand
只需要chrom chromstart chromend就够用。
一个问题是哈希只能找数字不能找范围,那么需要将范围转化为一个数字。用中值的办法也可行,但是有点麻烦。设计了一个树结构,想用改变后的树查找方式查找。
左右子树是intron 的范围,在第一个不大于处停下。在比较star和左子树大小即可。
但这样有点复杂,只是示意图,实际做需要对每个染色体单独建二叉树,再先找大的(先找小的也一样),然后将小的与同一节点下的左子节点比对。
根据需求,可以简化建树过程。根据有序数据创建二叉树,将有序数组放入二叉树。https://blog.csdn.net/weixin_42813521/article/details/107648974
先建立,再遍历。采用深度优先的遍历方式。
但是并不是顺序的,而是重叠的,这样建树有困难。感觉不应该重叠
重新组装。看了转录组的重新组装感觉第二部分要做的事和重头组装有一定相似性。
想统计重叠区域,检查sort过的bam文件:
确实是按照起始位点的顺序排序的,read也不再是一对一对的。也同igv上看到的----有起始位点相同的区域。
相同序列,相同起点和比对情况,不同测序。
选取最长的?
最上面是igv自己拼接的?
有点奇怪 将中部没有的也拼接上
?
5' 对齐
3‘有空隙
方向无关
intron问题:
1的同名位置不同 12的条带不同
合起来?
同一段位置上有很多,以哪个为标准?
重新组装问题
取在intron内的序列
1.对gff3,有去重和以什么为主的问题。一个位置当然要么只能是intron,要么不是。先grep出所有正链,做一个正链gff3文件。
命令:
grep + Arabidopsis_thaliana.TAIR10.47.intron.gff3 > Arabidopsis_thaliana.TAIR10.47.only+intron.gff3
结果是没sort过的,不是从小到大。
2.去重+以长的为intron获取位置
因此先不uniq,而是先sort,看一下重复和igv中是否一样多。且有的重复同名,有的不同名
命令:
sort Arabidopsis_thaliana.TAIR10.47.only+intron.gff3 -t " " -k 4 -n > TAIR10.47.only+intron_sort.gff3
要点:-t后面想输入tab需要ctrl+v 再在“”内按tab -n告知以数字排序,不然会出错。
结果:
parent不同,但intron位置相同,命名不同。名不同主要可能是因为parent不同。而确实在一个gene下的intron没有重复位置的。
问题应该是:不同gene下的intron有相同位置。
情况比较普遍。
提取start end chr,按chr,start,end格式。
命令:
cut -d " " -f 1,4,5 Arabidopsis_thaliana.TAIR10.47.only+intron.gff3 | uniq > TAIR10.47.only+intron_uniq.gff3
chr1 chr2一变就重新起头
有两个问题:
1.sam的chr变是否重排位置
2.0-base和1-base问题
3.uniq出的结果里,有没有start 一样end不一样的?
确实有。如果我只留最大的,那么反向排序再uniq
命令:
sort TAIR10.47.only+intron_uniq.gff3 -t " " -k 2 -n -r > TAIR10.47.only+intron_uniq_reverse.gff3
结果:
改一下,先cut和sort reverse
再按chr5 -> chr1 pt mt 分成七个文件,再对每个文件sort
不对 先cut出七个文件。
先egrep出7个文件,再对每个文件sort -n -k 2
因为是从左往右的,所以正序时,开头小结尾一样的应该被开头大的覆盖,而倒序可以将开头一样结尾小的去掉。
总之,需要uniq一下开头和结尾。都应该是倒序以去掉小的。
先对倒序的文件uniq,因为先出现的是end大的,所以同start但是end小的被去除。同时保持倒序不变,再对end一致的同样从操作,以去掉同endstart小的,因为文件已start排序,start大的倒序后在前。
首先正序验证一下,再同样逆序处理
对比证明其如实保留了大的star。
至此,取出了chr1的intron start 和end
步骤总结:
1.先cut 145 列 并根据12列反向排序
2.grep出chr1的
3.对反向文件,同start保留最大end
4.保持反向,对同end取最大start
由此取出范围内最大的intron起止位置。下一步放入树中,比对在intron内的序列。