使用R从XML文件提取数据的问题 [英] Issues Extracting Data from XML files using R

查看:264
本文介绍了使用R从XML文件提取数据的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在xml文件中有一个大型蛋白质数据库,我需要使用R提取一些信息.该数据库由条目组成,其中包含有关我需要提取和格式化的特定蛋白质的信息.

I have a large database of proteins in an xml file that I need to extract some information from using R. The database is organized by entries, which contain information about the specific protein that I need to extract and format.

https://www.dropbox.com/s/dq8ir9f22cnfwrz/Sample.xml

我想提取名称,所有类型为"EC"的dbReferences以及每个条目的顺序.到目前为止,我有:

I would like to extract the name, all the dbReferences that are type "EC", and the sequence for each entry. So far I have:

library("XML")
doc <- xmlParse("Sample.xml")

我在考虑使用xpathSApply函数显式选择要访问的标签,还是使用xmlToDataFrame函数.我是R的新手,所以我对从哪里开始感到困惑.

I was thinking of either using the xpathSApply function to explicitly pick tags to go to, or the xmlToDataFrame function. I'm new to R, so I'm a bit confused as to where to begin.

推荐答案

只需从getNodeSet中选择所需的元素

Just select the elements you need from getNodeSet

nd <- getNodeSet(doc, "//ns:entry", namespaces=c(ns=getDefaultNamespace(doc)[[1]]$uri))
y <- data.frame( id = sapply(nd, xpathSApply, './*[local-name()="name"]', xmlValue),
        ec = sapply(nd,  function(y) paste( xpathSApply(y, './/*[local-name()="dbReference" and @type="EC"]/@id'), collapse="; ")),
 sequence =  gsub("\n", "", sapply(nd, xpathSApply, './*[local-name()="sequence"]', xmlValue)))

head(y, 3)
           id                                                                      ec                                         sequence  
1 AK1C3_HUMAN 1.-.-.-; 1.1.1.357; 1.1.1.112; 1.1.1.188; 1.1.1.239; 1.1.1.64; 1.3.1.20  MDSKHQCVKLNDGHFMPVLGFGTYAPPEVPRSKALEVTKLAIEA...
2 CP3A4_HUMAN              1.14.13.-; 1.14.13.157; 1.14.13.32; 1.14.13.67; 1.14.13.97  MALIPDLAMETWLLLAVSLVLLYLYGTHSHGLFKKLGIPGPTPL...
3 AK1C1_HUMAN                                 1.1.1.-; 1.1.1.149; 1.1.1.112; 1.3.1.20  MDSKYQCVKLNDGHFMPVLGFGTYAPAEVPKSKALEATKLAIEA...

您还可以删除名称空间并简化这些查询...

You could also drop the namespace and simplify these queries...

x <- readLines("Sample.xml")
x[2] <- "<uniprot>"
doc <- xmlParse(x)
nd <- getNodeSet(doc, "//entry")

或者改用Uniprot的Rest服务

OR use the Rest services from Uniprot instead

查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆