XML to dataframe 如果节点不存在,如何获取默认值 [英] XML to dataframe How to get a default value if node does not exist

查看:39
本文介绍了XML to dataframe 如果节点不存在,如何获取默认值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 R 中,我想使用 XML 包解析 XML 文件.实际文件取自 Eurostats REST 服务.您将在问题末尾找到指向实际数据的链接.文件的相关结构如下:

In R, I want to parse an XML file using the XML package. The actual file is taken from Eurostats REST service. You will find a link to the actual data at the end of the question. The relevant structure of the file is as follows:

doc <- xmlParse( # needed to run example
'<?xml version="1.0" ?>
<Series>
  <Obs>
    <ObsDimension value="2009"/>
    <ObsValue value="NaN"/>
    <Attributes>
      <Value id="OBS_STATUS" value="na"/>
    </Attributes>
  </Obs>
  <Obs>
    <ObsDimension value="2006"/>
    <ObsValue value="NaN"/>
    <Attributes>
      <Value id="OBS_STATUS" value="na"/>
    </Attributes>
  </Obs>
  <Obs>
    <ObsDimension value="2009"/>
    <ObsValue value="43.75"/>
  </Obs>
  <Obs>
    <ObsDimension value="2006"/>
    <ObsValue value="NaN"/>
    <Attributes>
      <Value id="OBS_STATUS" value="na"/>
      <Value id="OBS_FLAG" value="e"/>
    </Attributes>
  </Obs>
</Series>
') # needed to run example

因此每个 Obs 节点都有一个维度和一个值.此外,还有两个可选属性,由 id 属性 OBS_STATUS 或 OBS_FLAG 标识.我想解析这个结构,以便在属性不存在时使用 NA .结果应该是这样的:

So there is a Dimension and a Value for each Obs node. In addition there are two optional Attributes which are identified by an id attribute OBS_STATUS or OBS_FLAG. I want to parse this structure into so that NA is used if the attributes are not present. The result should look like this:

  dimension value status flag
1      2009   NaN     na <NA>
2      2006   NaN     na <NA>
3      2009 43.75   <NA> <NA>
4      2006   NaN     na    e

我准备了以下代码,但显然失败了,因为列的长度不相等.

I prepared the following code which obviouly fails, because the columns are not of equal length.

library(XML)
data.frame(dimension = xpathSApply(doc,"//ObsDimension",xmlGetAttr,"value"),
           value = xpathSApply(doc,"//ObsValue",xmlGetAttr,"value"),
           status = xpathSApply(doc,
                                "//Attributes/Value[@id='OBS_STATUS']",
                                xmlGetAttr,"value"),
           flag = xpathSApply(doc,
                                "//Attributes/Value[@id='OBS_FLAG']",
                                xmlGetAttr,"value"))

如果指定节点不存在,是否有定义可选值的好方法?任何帮助将不胜感激.

Is there a good way of defining an optional value if a specified node is not present? Any help would be greatly appreciated.

附录 在收到@MrFlick 的回答后添加.我实际需要解析的数据可以通过如下代码加载:

Addendum added after receiving Answer by @MrFlick. The data I actually need to parse can be loaded with the following code:

library(XML)
library(RCurl)
file <- "http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/cdh_e_fos/..PC.FOS1.BE/?startperiod=2005&endPeriod=2013"
content <- getURL(file, httpheader = list('User-Agent' = 'R-Agent'))
root <- xmlRoot(xmlInternalTreeParse(content, useInternalNodes = TRUE))

推荐答案

Take 1

这是一种可能的策略.有一个很好的 xmlToDataFrame 函数,但是您的数据格式不完全正确.我认为将您的数据转换为更合适的格式然后使用该功能是最简单的.这是一种这样的转换

Take 1

Here is one possible strategy. There is a nice xmlToDataFrame function, but your data isn't quite in the right format for that. I think it would be easiest to transform your data into a more suitable format and then use that function. Here's one such transformation

trn<-newXMLDoc()
addChildren(trn, newXMLNode("data"))

for(x in getNodeSet(doc, "//Obs")) {
    row<-newXMLNode("row")
    for( z in getNodeSet(x, ".//*[not(*)]")) {
        li <- newXMLNode(xmlGetAttr(z, "id", xmlName(z)))
        addChildren(li, newXMLTextNode(xmlGetAttr(z, "value",NA)))
        addChildren(row, li)
    }
    addChildren(xmlRoot(trn), row)
}

我们创建一个新的 XML 文档,最终看起来像

We create a new XML document that ends up looking like

<?xml version="1.0"?>
<data>
  <row>
    <ObsDimension>2009</ObsDimension>
    <ObsValue>NaN</ObsValue>
    <OBS_STATUS>na</OBS_STATUS>
  </row>
  <row>
    <ObsDimension>2006</ObsDimension>
    <ObsValue>NaN</ObsValue>
    <OBS_STATUS>na</OBS_STATUS>
  </row>
  <row>
    <ObsDimension>2009</ObsDimension>
    <ObsValue>43.75</ObsValue>
  </row>
  <row>
    <ObsDimension>2006</ObsDimension>
    <ObsValue>NaN</ObsValue>
    <OBS_STATUS>na</OBS_STATUS>
    <OBS_FLAG>e</OBS_FLAG>
  </row>
</data>

我们可以打电话

xmlToDataFrame(trn)

得到

  ObsDimension ObsValue OBS_STATUS OBS_FLAG
1         2009      NaN         na     <NA>
2         2006      NaN         na     <NA>
3         2009    43.75       <NA>     <NA>
4         2006      NaN         na        e

是的,我使用了一些难看的 for 循环,但这实际上是为了确保我们为每个 Obs 节点创建一个值.这确实是数据的主要单位,因此在使用 xpath 抓取节点时不能跳过它.您可以直接在循环中构建 data.frame,但我更喜欢让 xmlToDataFrame 处理每个节点可能具有不同数量元素的事实.

Yes I use some ugly for loops, but that's really to make sure we create a value for each Obs node. That's really the primary unit of data so you can't skip over it when grabbings nodes with xpath. You could build the data.frame directly in the loop, but i prefer to let xmlToDataFrame take care of the fact that each node has a potentially different number of elements.

如果你真的需要在节点不存在时指定一个默认值,你可以创建一个类似于 xmlGetAttr 的函数,但一个也检查节点的函数.这里有一个辅助函数.

If you really need to specify a default value when a Node doesn't exist, you can create a function similr to xmlGetAttr but one that also checks for a node as well. Here is such a helper function.

xmlGetNodeAttr <- function(n, xp, attr, default=NA) {
    ns<-getNodeSet(n, xp)
    if(length(ns)<1) {
        return(default)
    } else {
        sapply(ns, xmlGetAttr, attr, default)
    }
}

我们可以将其应用于您的数据

We could apply it to your data with

do.call(rbind, lapply(xmlChildren(xmlRoot(doc)), function(x) {
    data.frame(
        dimension=xmlGetNodeAttr(x, "./ObsDimension","value",NA),
        value=xmlGetNodeAttr(x, "./ObsValue","value",NA),
        status=xmlGetNodeAttr(x, "./Attributes/Value[@id='OBS_STATUS']","value",NA),
        flag=xmlGetNodeAttr(x, "./Attributes/Value[@id='OBS_FLAG']","value",NA)
    )
}))

产生相同的结果.这里我们仍然必须单独遍历 Obs 节点,因为没有办法强制每个 Obs 与 xpath 匹配.

which produces the same result. Here we still must loop over the Obs nodes individually because there is no way to force a match for each Obs with xpath.

这篇关于XML to dataframe 如果节点不存在,如何获取默认值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆