查询区域内的基因 [英] Query genes within regions

查看：90 发布时间：2021/4/15 19:47:12 r bioinformatics bioconductor biomart

本文介绍了查询区域内的基因的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想检索一系列区域中存在的基因.说，我有一个带有查询位置的床文件，例如:

  1 2665697 4665777 MIR2011 10391435 12391516 MIR5001 15106831 17106911 MIR1221 23436535 25436616 MIR2341 23436575 25436656 MIR488

我想得到属于那些区域的基因.

我尝试使用 biomaRt 和 bedtools 相交，但是我得到的输出是对应于所有区域的基因列表，而不是一一对应我想要获得的期望输出将是每一行中的基因，但是在单独的行中，如果我一次执行一个查询区域.基本上，我想知道哪些基因属于每个区域，但仍然能够识别哪些基因属于哪些区域.

我正在做的是，从检测到的miRNA区域向上和向下扩展基因组区域，以便从该miRNA中获得邻近的基因.我正在上下使用一百万个基地的窗户.这仅适用于一个查询，但是，如何使用biomaRt进行许多查询或使用bedtools进行许多交集，所以我得到的像是:

  1 2665697 4665777 MIR201 GENEX，GENEY，GENEZ ...1 10391435 12391516 MIR500 GENEA，GENEB，GENEC ...1 15106831 17106911 MIR1221 23436535 25436616 MIR2341 23436575 25436656 MIR488

意思是GENEX，GENNEY和GENEZ落在1:2665697-4665777之内，而MIR201位于中间，因为计算该区域可减去sart的1百万bp，并为末端位置增加100万bp.

我正在某种程度上确定每个miRNA的邻近基因，以便在物种内进行比较，但是我不知道如何使用 biomaRt 或 bedtools 分别查询多个区域./p>

有帮助吗?

解决方案

与 @Jimbou 相同，但没有tidyverse :

 库(biomaRt)# 数据d<-read.table(text ="1 2665697 4665777 MIR2011 10391435 12391516 MIR5001 15106831 17106911 MIR1221 23436535 25436616 MIR2341 23436575 25436656 MIR488)#指定数据库ensembl = useMart("ensembl"，数据集="hsapiens_gene_ensembl")#遍历行，获取基因，然后粘贴以折叠，#最后与数据d绑定在一起.res<-cbind(d，基因= apply(d，1，function(i){x<-getBM(attributes = c("external_gene_name")，过滤器= c("chromosome_name"，"start"，"end")，值= list(i [1]，i [2]，i [3])，mart =合奏)#仅保留3个基因，因为输出太长.#在您的情况下，删除以下行x<-头(x，3)#返回基因，逗号分隔粘贴(x $ external_gene_name，折叠=，")}))资源#V1 V2 V3 V4基因#1 1 2665697 4665777 MIR201 TTC34，AC242022.1，AL592464.2#2 1 10391435 12391516 MIR500 AL139424.2，PGD，AL139424.1#3 1 15106831 17106911 MIR122 KAZN，TMEM51-AS1，TMEM51#4 1 23436535 25436616 MIR234 ASAP3，E2F2，AL021154.1#5 1 23436575 25436656 MIR488 ASAP3，E2F2，AL021154.1

I want to retrieve the genes that are present within a series of regions. Say, I have a bed file with query positions such like:

1     2665697     4665777      MIR201
1    10391435    12391516      MIR500
1    15106831    17106911      MIR122
1    23436535    25436616      MIR234 
1    23436575    25436656      MIR488

I would like to get the genes that fall within those regions.

I have tried using biomaRt, and bedtools intersect, but the output I get, is a list of genes corresponding to all the regions, not one by one, as the desired output I would like to get would be the genes within each row, but in separate rows, a if I did one query region at a time. Basically I want to know what genes fall within each region, but still being able to identify which genes fall in which regions.

What I am doing is, from a region of detected miRNA, I am expanding the genome region upwards and downwards, so that I get the neighboring genes from this miRNA. I am using a 1 million bases windows up and down. This would work for just one query, but, how to do many queries with biomaRt or many intersections with bedtools, so that I get somewhat like:

1     2665697     4665777      MIR201      GENEX, GENEY, GENEZ...
1    10391435    12391516      MIR500      GENEA, GENEB, GENEC...
1    15106831    17106911      MIR122
1    23436535    25436616      MIR234 
1    23436575    25436656      MIR488

Meaning that GENEX, GENEY and GENEZ fall within 1:2665697-4665777, with MIR201, placed in the middle, as this region is calculated subtracting 1 million bp to sart, and adding 1 million bp to end position.

I am somewhat determining the neighboring genes from each miRNA, to compare within species, but I do not get how to query multiple regions individually using biomaRt or bedtools.

Any help?

解决方案

Same approach as @Jimbou without tidyverse:

library(biomaRt)

# data
d <- read.table(text = "1     2665697     4665777      MIR201
1    10391435    12391516      MIR500
1    15106831    17106911      MIR122
1    23436535    25436616      MIR234 
1    23436575    25436656      MIR488")

# specify the database
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# loop through rows, get genes, then paste with collapse,
# and finally bind back with data d.
res <- cbind(
  d,
  genes = apply(d, 1, function(i){
    x <- getBM(attributes=c("external_gene_name"), 
               filters = c("chromosome_name" , "start", "end"), 
               values = list(i[1], i[2], i[3]), 
               mart = ensembl)
    # keeping only 3 genes, as output is too long.
    # In your case remove below line
    x <- head(x, 3)

    # return genes, comma separated
    paste(x$external_gene_name, collapse = ",")
  })
)

res
#   V1       V2       V3     V4                       genes
# 1  1  2665697  4665777 MIR201 TTC34,AC242022.1,AL592464.2
# 2  1 10391435 12391516 MIR500   AL139424.2,PGD,AL139424.1
# 3  1 15106831 17106911 MIR122      KAZN,TMEM51-AS1,TMEM51
# 4  1 23436535 25436616 MIR234       ASAP3,E2F2,AL021154.1
# 5  1 23436575 25436656 MIR488       ASAP3,E2F2,AL021154.1

这篇关于查询区域内的基因的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

查询区域内的基因 [英] Query genes within regions

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

查询区域内的基因 [英] Query genes within regions

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭