当值有多个或缺失时,来自 XML 的 R 数据框 [英] R dataframe from XML when values are multiple or missing

查看:13
本文介绍了当值有多个或缺失时,来自 XML 的 R 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题类似于上一个问题,导入所有XML 的字段(和子字段)作为数据框,但我只想提取 XML 数据的一个子集并希望包含缺失/多个值.

This question is similar to a previous question, Import all fields (and subfields) of XML as dataframe, but I want to pull out only a subset of the XML data and want to include missing/multiple values.

我从一个 XML 文件开始,想根据它包含的一些数据在 R 中构造一个数据框,这些数据由 XML 元素的内容定义.用一个例子来解释是最容易的.下面,我想挑出每个城市的地标信息(即使没有地标元素或有几个),而忽略车站信息.

I start with an XML file and want to construct a dataframe in R based on some of the data it contains, defined by the contents of XML elements. It is easiest to explain with an example. In the below, I want to pick out the information about landmarks for every city (even if there is no landmark element or there are several) and ignore the information about stations.

<world>
    <city>
        <name>London</name>
        <buildings>
            <building>
                <type>landmark</type>
                <bname>Tower Bridge</bname>
            </building>
            <building>
                <type>station</type>
                <bname>Waterloo</bname>
            </building>
        </buildings>
    </city>
    <city>
        <name>New York</name>
        <buildings>
            <building>
                <type>station</type>
                <bname>Grand Central</bname>
            </building>
        </buildings>
    </city>
    <city>
        <name>Paris</name>
        <buildings>
            <building>
                <type>landmark</type>
                <bname>Eiffel Tower</bname>
            </building>
            <building>
                <type>landmark</type>
                <bname>Louvre</bname>
            </building>
        </buildings>
    </city>
</world>

理想情况下,这将进入一个看起来像这样的数据帧:

Ideally this would go into a dataframe that looks something like this:

 London      Tower Bridge
 New York    NA
 Paris       Eiffel Tower
 Paris       Louvre

我认为可能有一种方法可以使用 XML 库和 xpathSApply 来做到这一点,但我想我被打败了.

I assumed there might be a way to do this using the XML library and xpathSApply but I think I'm beaten.

如果不参考示例,也无法思考如何表述问题,因此请随时编辑以提供更具描述性的问题.

Also couldn't think how to phrase the question without just referring to the example so feel free to edit to give a more descriptive question.

推荐答案

假设 XML 数据在一个名为 world.xml 的文件中,读取它并在城市中迭代提取城市 任何相关地标的名称bname:

Assuming the XML data is in a file called world.xml read it in and iterate over the cities extracting the city name and the bname of any associated landmarks :

library(XML)
doc <- xmlParse("world.xml", useInternalNodes = TRUE)

do.call(rbind, xpathApply(doc, "/world/city", function(node) {

   city <- xmlValue(node[["name"]])

   xp <- "./buildings/building[./type/text()='landmark']/bname"
   landmark <- xpathSApply(node, xp, xmlValue)
   if (is.null(landmark)) landmark <- NA

   data.frame(city, landmark, stringsAsFactors = FALSE)

}))

结果是:

      city     landmark
1   London Tower Bridge
2 New York         <NA>
3    Paris Eiffel Tower
4    Paris       Louvre

这篇关于当值有多个或缺失时,来自 XML 的 R 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆