R?中的基因列表(带有ENTREZID)的基因本体(GO)分析. [英] Gene ontology (GO) analysis for a list of Genes (with ENTREZID) in R?

查看:562
本文介绍了R?中的基因列表(带有ENTREZID)的基因本体(GO)分析.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对GO分析非常陌生,我对如何进行基因列表感到有些困惑.

我有一个基因列表(n = 10):

gene_list

    SYMBOL ENTREZID                              GENENAME
1    AFAP1    60312   actin filament associated protein 1
2  ANAPC11    51529 anaphase promoting complex subunit 11
3   ANAPC5    51433  anaphase promoting complex subunit 5
4     ATL2    64225                     atlastin GTPase 2
5    AURKA     6790                       aurora kinase A
6    CCNB2     9133                             cyclin B2
7    CCND2      894                             cyclin D2
8    CDCA2   157313      cell division cycle associated 2
9    CDCA7    83879      cell division cycle associated 7
10  CDCA7L    55536 cell division cycle associated 7-like

,我只是想找到它们的功能,因此建议我使用GO分析工具. 我不确定这是否是正确的方法. 这是我的解决方案:

x<-org.Hs.egGO

# Get the entrez gene identifiers that are mapped to a GO ID

    xx<- as.list(x[gene_list$ENTREZID])

因此,我有一个带有EntrezID的列表,这些列表已为每个基因分配了多个GO项. 例如:

> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"

$`GO:0009966`$Evidence
[1] "IEA"

$`GO:0009966`$Ontology
[1] "BP"


$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"

$`GO:0051493`$Evidence
[1] "IEA"

$`GO:0051493`$Ontology
[1] "BP"

我的问题是: 我怎样才能以一种简单的方式找到这些基因的功能,我也想知道自己是否做对了? 因为我想将该函数作为function/GO列添加到gene_list.

预先感谢

解决方案

有一个新的 Bioinformatics SE (当前处于beta模式).


希望我能达到您的目标.

顺便说一句,对于与生物信息学相关的主题,您还可以查看具有相同目的的 biostar 作为SO,但用于生物信息学

如果只想列出与该基因相关的每个功能,则可以查询数据库,例如 ENSEMBl biomaRt 生物导体程序包,该程序包是API查询biomart数据库. 您仍然需要互联网才能进行查询.

Bioconductor提出了用于生物信息学研究的软件包,这些软件包通常都带有良好的渐晕,可帮助您完成分析的不同步骤(甚至强调您应如何设计数据,否则将是一些陷阱). /p>

在您的情况下,直接从 biomaRt小插图-特别是任务2:

注意:我在下面报告的方法有一些更快的方法:

# load the library
library("biomaRt")

# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts() 
ensembl=useMart("ensembl")

# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)

您需要创建查询(您的ENTREZ ID列表).要查看可以查询的过滤器:

filters = listFilters(ensembl)

然后您想要检索属性:您的GO编号和描述.查看可用属性列表

attributes = listAttributes(ensembl)

对于您来说,查询看起来像是:

goids = getBM(

        #you want entrezgene so you know which is what, the GO ID and
        # name_1006 is actually the identifier of 'Go term name'
        attributes=c('entrezgene','go_id', 'name_1006'), 

        filters='entrezgene', 
        values=gene_list$ENTREZID, 
        mart=ensembl)

查询本身可能需要一段时间.

然后,您始终可以将信息折叠到两列中(但我不会建议将其用于其他报告目的).

Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
                           tempo<-goids[goids$entrezgene==x,]
                           return(
                                   data.frame('ENTREZGENE'= x,
                                  'Go.ID'= paste(tempo$go_id,collapse=' ; '),
                                  'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)


如果要查询ensembl数据库的过去版本:

ens82<-useMart(host='sep2015.archive.ensembl.org',
               biomart='ENSEMBL_MART_ENSEMBL',
               dataset='hsapiens_gene_ensembl')

,然后查询将是:

goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),  
        filters='entrezgene',values=gene_list$ENTREZID, 
        mart=ens82)


但是,如果您打算进行GO富集分析,那么您的基因列表太短了.

I am very new with the GO analysis and I am a bit confuse how to do it my list of genes.

I have a list of genes (n=10):

gene_list

    SYMBOL ENTREZID                              GENENAME
1    AFAP1    60312   actin filament associated protein 1
2  ANAPC11    51529 anaphase promoting complex subunit 11
3   ANAPC5    51433  anaphase promoting complex subunit 5
4     ATL2    64225                     atlastin GTPase 2
5    AURKA     6790                       aurora kinase A
6    CCNB2     9133                             cyclin B2
7    CCND2      894                             cyclin D2
8    CDCA2   157313      cell division cycle associated 2
9    CDCA7    83879      cell division cycle associated 7
10  CDCA7L    55536 cell division cycle associated 7-like

and I simply want to find their function and I've been suggested to use GO analysis tools. I am not sure if it's a correct way to do so. here is my solution:

x <- org.Hs.egGO

# Get the entrez gene identifiers that are mapped to a GO ID

    xx<- as.list(x[gene_list$ENTREZID])

So, I've got a list with EntrezID that are assigned to several GO terms for each genes. for example:

> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"

$`GO:0009966`$Evidence
[1] "IEA"

$`GO:0009966`$Ontology
[1] "BP"


$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"

$`GO:0051493`$Evidence
[1] "IEA"

$`GO:0051493`$Ontology
[1] "BP"

My question is : how can I find the function for each of these genes in a simpler way and I also wondered if I am doing it right or? because I want to add the function to the gene_list as a function/GO column.

Thanks in advance,

解决方案

EDIT: There is a new Bioinformatics SE (currently in beta mode).


I hope I get what you are aiming here.

BTW, for bioinformatics related topics, you can also have a look at biostar which have the same purpose as SO but for bioinformatics

If you just want to have a list of each function related to the gene, you can query database such ENSEMBl through the biomaRt bioconductor package which is an API for querying biomart database. You will need internet though to do the query.

Bioconductor proposes packages for bioinformatics studies and these packages come generally along with good vignettes which get you through the different steps of the analysis (and even highlight how you should design your data or which would be then some of the pitfalls).

In your case, directly from biomaRt vignette - task 2 in particular:

Note: there are slightly quicker way that the one I reported below:

# load the library
library("biomaRt")

# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts() 
ensembl=useMart("ensembl")

# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)

You need to create your query (your list of ENTREZ ids). To see which filters you can query:

filters = listFilters(ensembl)

And then you want to retrieve attributes : your GO number and description. To see the list of available attributes

attributes = listAttributes(ensembl)

For you, the query would look like something as:

goids = getBM(

        #you want entrezgene so you know which is what, the GO ID and
        # name_1006 is actually the identifier of 'Go term name'
        attributes=c('entrezgene','go_id', 'name_1006'), 

        filters='entrezgene', 
        values=gene_list$ENTREZID, 
        mart=ensembl)

The query itself can take a while.

Then you can always collapse the information in two columns (but I won't recommend it for anything else that reporting purposes).

Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
                           tempo<-goids[goids$entrezgene==x,]
                           return(
                                   data.frame('ENTREZGENE'= x,
                                  'Go.ID'= paste(tempo$go_id,collapse=' ; '),
                                  'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)


Edit:

If you want to query a past version of the ensembl database:

ens82<-useMart(host='sep2015.archive.ensembl.org',
               biomart='ENSEMBL_MART_ENSEMBL',
               dataset='hsapiens_gene_ensembl')

and then the query would be:

goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),  
        filters='entrezgene',values=gene_list$ENTREZID, 
        mart=ens82)


However, if you had in mind to do a GO enrichment analysis, your list of genes is too short.

这篇关于R?中的基因列表(带有ENTREZID)的基因本体(GO)分析.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆