当值有多个或缺失时,来自 XML 的 R 数据框 [英] R dataframe from XML when values are multiple or missing
问题描述
这个问题类似于上一个问题,导入所有XML 的字段(和子字段)作为数据框,但我只想提取 XML 数据的一个子集并希望包含缺失/多个值.
This question is similar to a previous question, Import all fields (and subfields) of XML as dataframe, but I want to pull out only a subset of the XML data and want to include missing/multiple values.
我从一个 XML 文件开始,想根据它包含的一些数据在 R 中构造一个数据框,这些数据由 XML 元素的内容定义.用一个例子来解释是最容易的.下面,我想挑出每个城市的地标信息(即使没有地标元素或有几个),而忽略车站信息.
I start with an XML file and want to construct a dataframe in R based on some of the data it contains, defined by the contents of XML elements. It is easiest to explain with an example. In the below, I want to pick out the information about landmarks for every city (even if there is no landmark element or there are several) and ignore the information about stations.
<world>
<city>
<name>London</name>
<buildings>
<building>
<type>landmark</type>
<bname>Tower Bridge</bname>
</building>
<building>
<type>station</type>
<bname>Waterloo</bname>
</building>
</buildings>
</city>
<city>
<name>New York</name>
<buildings>
<building>
<type>station</type>
<bname>Grand Central</bname>
</building>
</buildings>
</city>
<city>
<name>Paris</name>
<buildings>
<building>
<type>landmark</type>
<bname>Eiffel Tower</bname>
</building>
<building>
<type>landmark</type>
<bname>Louvre</bname>
</building>
</buildings>
</city>
</world>
理想情况下,这将进入一个看起来像这样的数据帧:
Ideally this would go into a dataframe that looks something like this:
London Tower Bridge
New York NA
Paris Eiffel Tower
Paris Louvre
我认为可能有一种方法可以使用 XML 库和 xpathSApply
来做到这一点,但我想我被打败了.
I assumed there might be a way to do this using the XML library and xpathSApply
but I think I'm beaten.
如果不参考示例,也无法思考如何表述问题,因此请随时编辑以提供更具描述性的问题.
Also couldn't think how to phrase the question without just referring to the example so feel free to edit to give a more descriptive question.
推荐答案
假设 XML 数据在一个名为 world.xml
的文件中,读取它并在城市中迭代提取城市 任何相关地标的名称
和bname
:
Assuming the XML data is in a file called world.xml
read it in and iterate over the cities extracting the city name
and the bname
of any associated landmarks :
library(XML)
doc <- xmlParse("world.xml", useInternalNodes = TRUE)
do.call(rbind, xpathApply(doc, "/world/city", function(node) {
city <- xmlValue(node[["name"]])
xp <- "./buildings/building[./type/text()='landmark']/bname"
landmark <- xpathSApply(node, xp, xmlValue)
if (is.null(landmark)) landmark <- NA
data.frame(city, landmark, stringsAsFactors = FALSE)
}))
结果是:
city landmark
1 London Tower Bridge
2 New York <NA>
3 Paris Eiffel Tower
4 Paris Louvre
这篇关于当值有多个或缺失时,来自 XML 的 R 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!