用自建gtf构建的Txdb文件,你需要分成几步?
![](http://www.onekao.net/templets/default/images/content_ad.gif)
对于提取序列来说,我们通常会选择用、等,由于她们有特别建立的操作序列的功能,例如取两个bed文件的交集,提取某段序列等等。但实际上,她们对于区间的操作是不如R便捷的,例如取基因body,取第一个外显子,取启动子区域等等。所以,我们有时侯才会想能够在R上面,花式找了些区间以后,提取她们的序列,实际上,这也是可以的,须要分成几步。
#
首先我们须要先加载一些包
#
## BSgenome.Athaliana.TAIR.TAIR9是拟南芥的基因组序列包 # TAIR9和TAIR10是一样的,无非是TAIR10自身的基因注释丰富了些 library(BSgenome.Athaliana.TAIR.TAIR9) # TxDb.Athaliana.BioMart.plantsmart28是拟南芥的Txdb包 # 就是我们ChIPSeeker做peak注释的那个Txdb library(TxDb.Athaliana.BioMart.plantsmart28) Txdb_package <- TxDb.Athaliana.BioMart.plantsmart28 # 也可以用自己的gtf做 # 所以这些操作也可以针对非模式物种 Txdb_gtf <- makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gtf")
#
之后是提取基因的区间
#
> gene_package <- genes(Txdb_package)
> gene_package
GRanges object with 33602 ranges and 1 metadata column:
seqnames ranges strand | gene_id
|
AT1G01010 1 3631-5899 + | AT1G01010
AT1G01020 1 5928-8737 - | AT1G01020
AT1G01030 1 11649-13714 - | AT1G01030
AT1G01040 1 23146-31227 + | AT1G01040
AT1G01046 1 28500-28706 + | AT1G01046
... ... ... ... . ...
ATMG01370 Mt 360717-361052 - | ATMG01370
ATMG01380 Mt 361062-361179 - | ATMG01380
ATMG01390 Mt 361350-363284 - | ATMG01390
ATMG01400 Mt 363725-364042 + | ATMG01400
ATMG01410 Mt 366086-366700 - | ATMG01410
-------
seqinfo: 7 sequences (1 circular) from an unspecified genome
# 注意这里有个no seqlengths
> gene_gtf <- genes(Txdb_gtf)
> gene_gtf
GRanges object with 37330 ranges and 1 metadata column:
seqnames ranges strand | gene_id
|
AT1G01010 Chr1 3631-5899 + | AT1G01010
AT1G01020 Chr1 6788-9130 - | AT1G01020
AT1G01030 Chr1 11649-13714 - | AT1G01030
AT1G01040 Chr1 23121-31227 + | AT1G01040
AT1G01050 Chr1 31170-33171 - | AT1G01050
... ... ... ... . ...
ATMG09730 ChrM 181268-181347 + | ATMG09730
ATMG09740 ChrM 191954-192025 - | ATMG09740
ATMG09950 ChrM 333651-333725 - | ATMG09950
ATMG09960 ChrM 347266-347348 + | ATMG09960
ATMG09980 ChrM 359666-359739 - | ATMG09980
-------
seqinfo: 7 sequences (2 circular) from an unspecified genome; no seqlengths
#
#
之后依照基因区域来提取启动子序列 #
#> promoter_package <- promoters(gene_package,upstream = 4000,downstream = 500) Warning message: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) : GRanges object contains 11 out-of-bound ranges located on sequences 1, 2, 3, 4, and Pt. Note that ranges located on a sequence whose length is unknown (NA) or on a circular sequence are not considered out-of-bound (use seqlengths() and isCircular() to get the lengths and circularity flags of the underlying sequences). You can use trim() to trim these ranges. See ?`trim,GenomicRanges-method` for more information. > promoter_gtf <- promoters(gene_gtf,upstream = 4000,downstream = 500)
#
这儿就出现了一个坑了,就是你在用Txdb包的时侯,出现了启动序列,但用自建gtf建立的Txdb没有出现了。实际上,这些是由于你的区间取过界了。 #
> promoter_package GRanges object with 33602 ranges and 1 metadata column: seqnames ranges strand | gene_id
| AT1G01010 1 -369-4130 + | AT1G01010 AT1G01020 1 8238-12737 - | AT1G01020 AT1G01030 1 13215-17714 - | AT1G01030 AT1G01040 1 19146-23645 + | AT1G01040 AT1G01046 1 24500-28999 + | AT1G01046 ... ... ... ... . ... ATMG01370 Mt 360553-365052 - | ATMG01370 ATMG01380 Mt 360680-365179 - | ATMG01380 ATMG01390 Mt 362785-367284 - | ATMG01390 ATMG01400 Mt 359725-364224 + | ATMG01400 ATMG01410 Mt 366201-370700 - | ATMG01410 ------- seqinfo: 7 sequences (1 circular) from an unspecified genome #
#
而没有报错是由于他不晓得你的染色体的厚度范围在那儿。 #
所以下边我的操作都是基于启动序列,这样还可以应用在非模式物种上面。首先你在建立Txdb的时侯就须要给他传进去染色体厚度了。但你如何晓得每条染色体是多长呢,这就须要你的fai文件了。fai文件是给基因组fa做索引的时侯出现的,没有的话你可以直接用faidx做下。
#
#$ samtools faidx Athaliana.fa # 这里的第2列是染色体长度 Chr1 30427671 6 79 80 Chr2 19698289 30812844 79 80 Chr3 23459830 50760485 79 80 Chr4 18585056 74517281 79 80 Chr5 26975502 93337597 79 80 ChrM 366924 120654568 79 80 ChrC 154478 121026144 79 80
#
之后我们就可以在建立的时侯自动输进对应的厚度了,但这只是拟南芥这个组装完整的基因组,若果是一些非模式物种,级别的话,都会有数百条染色体厚度了,根本不可能自动输。所以我们还须要读进R上面进行一些操作。
#
# 注意不要变成因子
> seq_length <- read.table("~/reference/genome/TAIR10/Athaliana.fa.fai",
+ stringsAsFactors = F)[,1:2]
> seq_length$V1
[1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"
> seq_length$V2
[1] 30427671 19698289 23459830 18585056 26975502 366924 154478
> Txdb_gtf <- makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gtf",
+ chrominfo = Seqinfo(seqnames = seq_length$V1,
+ seqlengths = seq_length$V2))
#
#
这样我们提取启动子序列的时侯才会报错了 #
> gene_gtf <- genes(Txdb_gtf)
# 此时no seqlengths消失了
> gene_gtf
GRanges object with 37330 ranges and 1 metadata column:
seqnames ranges strand | gene_id
|
AT1G01010 Chr1 3631-5899 + | AT1G01010
AT1G01020 Chr1 6788-9130 - | AT1G01020
AT1G01030 Chr1 11649-13714 - | AT1G01030
AT1G01040 Chr1 23121-31227 + | AT1G01040
AT1G01050 Chr1 31170-33171 - | AT1G01050
... ... ... ... . ...
ATMG09730 ChrM 181268-181347 + | ATMG09730
ATMG09740 ChrM 191954-192025 - | ATMG09740
ATMG09950 ChrM 333651-333725 - | ATMG09950
ATMG09960 ChrM 347266-347348 + | ATMG09960
ATMG09980 ChrM 359666-359739 - | ATMG09980
-------
seqinfo: 7 sequences from an unspecified genome
# 出现了报错
> promoter_gtf <- promoters(gene_gtf,upstream = 4000,downstream = 500)
Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
GRanges object contains 15 out-of-bound ranges located on sequences Chr1, Chr2, Chr3, Chr4, Chr5, ChrC, and ChrM.
Note that ranges located on a sequence whose length is unknown (NA) or on a circular sequence are not considered
out-of-bound (use seqlengths() and isCircular() to get the lengths and circularity flags of the underlying
sequences). You can use trim() to trim these ranges. See ?`trim,GenomicRanges-method` for more information.
# 我们用trim函数去掉越过界的区间
# 原本过界的就变成了1了
> promoter_gtf <- trim(promoter_gtf)
> promoter_gtf
GRanges object with 37330 ranges and 1 metadata column:
seqnames ranges strand | gene_id
|
AT1G01010 Chr1 1-4130 + | AT1G01010
AT1G01020 Chr1 8631-13130 - | AT1G01020
AT1G01030 Chr1 13215-17714 - | AT1G01030
AT1G01040 Chr1 19121-23620 + | AT1G01040
AT1G01050 Chr1 32672-37171 - | AT1G01050
... ... ... ... . ...
ATMG09730 ChrM 177268-181767 + | ATMG09730
ATMG09740 ChrM 191526-196025 - | ATMG09740
ATMG09950 ChrM 333226-337725 - | ATMG09950
ATMG09960 ChrM 343266-347765 + | ATMG09960
ATMG09980 ChrM 359240-363739 - | ATMG09980
-------
seqinfo: 7 sequences from an unspecified genome
#
#
之后我们只取前2个基因下来瞧瞧疗效 #
# 这边的GRange是带名字的GRange,所以我们取区间的时候可以用基因名字
> promoter_gtf_part <- promoter_gtf[c("AT1G01010","AT1G01020")]
> promoter_gtf_part
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | gene_id
|
AT1G01010 Chr1 1-4130 + | AT1G01010
AT1G01020 Chr1 8631-13130 - | AT1G01020
-------
seqinfo: 7 sequences from an unspecified genome
# 用的是Biostrings包的getSeq
# 注意你这里的Grange的seqnames要和genome的seqnames要一致
# 有时候Grange是1,而Genome是Chr1
> getSeq(BSgenome.Athaliana.TAIR.TAIR9,promoter_gtf_part)
A DNAStringSet instance of length 2
width seq names
[1] 4130 CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCC...GGTTTCTGGTAAATGGAAGCTTACCGGAGAATCTGTTGAGGTCAAGG AT1G01010
[2] 4500 ATCCGCAACAATTCACCAATTGAAGAACAAGAGAAAGGTTTAAACTT...GAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGT AT1G01020
promoter_gtf_part_seq <- getSeq(BSgenome.Athaliana.TAIR.TAIR9,promoter_gtf_part)
# 然后用writeXStringSet写成fa
writeXStringSet(promoter_gtf_part_seq,
filepath = "test.fasta",
format = "fasta")
#
#
>AT1G01010 CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATG AATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTAT CGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGT GGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATT TGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTT GGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTG
#
#