R-XML 将节点拉入矩阵/DF 以解决缺失节点的问题 [英] R-XML pulling nodes into matrix/DF accounting for missing nodes

查看:22
本文介绍了R-XML 将节点拉入矩阵/DF 以解决缺失节点的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对使用 R 还很陌生,对使用 XML 包和 xpath 也很陌生.我需要从一个看起来像这样的 xml 文件中提取四个元素(除了我在这里修剪了很多其他 xmlnode 以简化它):

I am fairly new to using R and very new to using the XML package and xpath. I need to pull four elements from an xml file that looks like this (except that I have trimmed off a lot of other xmlnodes to simplify it here):

<?xml version="1.0" encoding="utf-8"?>
<iati-activities version="1.03" generated-datetime="2015-07-07T16:49:09+00:00">
  <iati-activity last-updated-datetime="2014-08-11T14:36:59+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100530</iati-identifier>
<title>Improvement of basic health care</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<participating-org role="Funding" ref="EU" type="15">EU</participating-org>
<participating-org role="Funding" type="21">Cordaid Memisa</participating-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="21">CORDAID RCA</participating-org>
<recipient-country percentage="100" code="CF">CENTRAL AFRICAN REPUBLIC</recipient-country>
<budget type="1">
  <period-start iso-date="2010-01-01"></period-start>
  <period-end iso-date="2013-02-28"></period-end>
</budget>
  </iati-activity>
  <iati-activity last-updated-datetime="2013-07-19T14:12:14+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100625</iati-identifier>
<title>Pigs for Pencils</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Funding" type="60">Stichting Kapatiran</participating-org>
<participating-org role="Implementing" type="22">PREDA Foundation Inc.</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
  <period-start iso-date="2010-04-20"></period-start>
  <period-end iso-date="2012-10-02"></period-end>
  <value value-date="2010-04-20">12500</value>
</budget>
   </iati-activity>
  <iati-activity last-updated-datetime="2015-04-08T03:01:58+00:00" xml:lang="en" default-currency="EUR">
    <iati-identifier>NL-KVK-41160054-100815</iati-identifier>
<title>Job and housing opportunities for women </title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="22">WISE</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
  <period-start iso-date="2010-10-01"></period-start>
  <period-end iso-date="2011-12-31"></period-end>
  <value value-date="2010-10-01">227000</value>
</budget>
  </iati-activity>
</iati-activities>

这也是我在 StackOverflow 上的第一个问题,如果我做的不正确(并且 xml 没有完全对齐),我深表歉意.我需要的元素,以及我分配给它们的元素是:

Also this is my first question ever on StackOverflow, so apologies if I'm not doing it correctly (and bc that xml is not perfectly aligned). The elements I need, and what I'm assigning them to are:

UniqueID <- "//iati-activity/iati-identifier"

GrantTitle <- "//iati-activity/title"

GrantAmount <- "//iati-activity/budget/value"

收件人 <- "//iati-activity/participatingorg[@role='Implementing']"

到目前为止(经过多次试验和磨难)我已经想出了这个代码,它通过当前节点(x),拉取4个变量,并将它们cbinding成一行,然后使用xpathApply循环访问iati-活动节点调用函数并将结果行绑定在一起.

So far (after much trial and tribulation) I have come up with this code, that goes through the current node (x), pulling the 4 variables, and cbinding them into a row, then using xpathApply to loop through iati-activity nodes calling the function and rbinding the resulting rows together.

当每个活动中存在所有四个元素时,此代码有效.但是,请注意 xml 示例中没有预算/价值节点.这是因为我删除了它是为了解决缺少节点的问题,对于我需要的几乎所有元素,它经常出现在完整文件中.

This code works when all four elements exist in each activity. However, note the absence of the budget/value node from the xml sample. This is because I removed it in order to solve this problem of missing nodes, which occur frequently in the full file for almost all the elements I need.

还要注意我的 xpath 表达式末尾的 [1] - 我已经包含了这些,因为还有多个标题、所有类型的多个参与组织等.

Also note the [1] at the end of my xpath expressions- I've included these because there are also multiple titles, multiple participating-orgs of all types, etc.

考虑到某些元素的倍数和其他元素的不存在,将所有相同的元素简单地拉入向量并将其弹出到数据框中变得不可能.因此需要循环遍历每个活动并用它拉动元素.我的代码目前无法解释缺失的元素(第一个 iati 活动中缺失的预算/值),因为 cbinding(和 rbinding)忽略空向量.

Given the multiples of some elements and the nonexistence of others, it makes it impossible to simple pull all the same elements into a vector and pop it into a data frame. Thus the need to loop through each activity pulling the elements with it. My code currently doesn't work to account for missing elements (the missing budget/value in the first iati-activity) because cbinding (and rbinding) ignore null vectors.

xmltestNA = xmlInternalTreeParse("XMLtoDF_TestNA.xml", useInternalNodes=TRUE)
bodyToDF <- function(x){
  UniqueID <- xpathSApply(x, "./iati-identifier", xmlValue)
  GrantTitle <- xpathSApply(x, "./title[1]", xmlValue)
  GrantAmount <- xpathSApply(x, "./budget/value[1]", xmlValue)
  Recipient <- xpathSApply(x, "./participating-org[@role='Implementing'][1]", xmlValue)
  cbind(UniqueID=UniqueID, GrantTitle=GrantTitle, GrantAmount=GrantAmount, Recipient=Recipient)
  }
res <-xpathApply(xmltestNA, '//iati-activity', fun=bodyToDF)
IatiNA <-do.call(rbind, res)
IatiNA

如何保留空值/缺失节点以将其转换为如下所示的矩阵或数据框:

How can I keep the null values/missing nodes in order to turn it into a matrix or dataframe that looks like this:

    UniqueID    GrantTitle  GrantAmount Recipient
1   NL-KVK-41160054-100530  Improvement of basic health care    NA  CORDAID RCA
2   NL-KVK-41160054-100625  Pigs for Pencils    12500   PREDA Foundation Inc.
3   NL-KVK-41160054-100815  Job and housing opportunities for women     227000  WISE

因为我还是新手,代码越简单越好.提前致谢!

Because I'm still new, the simpler the code, the better. Thanks in advance!

推荐答案

如果您的 xpath 查询返回的结果太多或太少,我认为使用节点会更容易

If your xpath queries return too many or few results, I think it's easier to work with the nodes

doc <- xmlParse( '<your xml here>')
nodes<- getNodeSet(doc, "//iati-activity")

#Compare
xpathSApply(doc, "//budget/value", xmlValue)
xpathSApply(doc, "//participating-org[@role='Funding']", xmlValue)

sapply(nodes, function(x) xpathSApply(x, "./budget/value", xmlValue))
sapply(nodes, function(x) xpathSApply(x, "./participating-org[@role='Funding']", xmlValue))

添加一个函数来处理缺失或多个节点,然后创建data.frame

Add a function to handle missing or multiple nodes and then create the data.frame

xpath2 <-function(x, path, fun = xmlValue, ...){
   y <- xpathSApply(x, path, fun, ...)
   ifelse(length(y) == 0, NA,
    ifelse(length(y) > 1, paste(unlist(y), collapse=", "), y))
}

GrantAmount <- sapply(nodes, xpath2, "./budget/value")
UniqueID    <- sapply(nodes, xpath2, "./iati-identifier")
GrantTitle  <- sapply(nodes, xpath2, "./title")
Recipient   <-  sapply(nodes, xpath2, "./participating-org[@role='Implementing']")
## updated xpath2 so xmlGetAttr will also work
Funding_ref  <- sapply(nodes, xpath2, "./participating-org[@role='Funding']", xmlGetAttr, "ref")
Budget_start <- sapply(nodes, xpath2, ".//period-start", xmlGetAttr, "iso-date")

data.frame(UniqueID, GrantTitle, GrantAmount, Recipient)
                UniqueID                               GrantTitle GrantAmount             Recipient
1 NL-KVK-41160054-100530         Improvement of basic health care        <NA>           CORDAID RCA
2 NL-KVK-41160054-100625                         Pigs for Pencils       12500 PREDA Foundation Inc.
3 NL-KVK-41160054-100815 Job and housing opportunities for women       227000                  WISE

这篇关于R-XML 将节点拉入矩阵/DF 以解决缺失节点的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆