值为多或缺失时,来自XML的R数据帧 [英] R dataframe from XML when values are multiple or missing

查看:146
本文介绍了值为多或缺失时,来自XML的R数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题与以前的问题类似,全部导入XML(作为数据框)的字段(和子字段),但是我想仅提取XML数据的一部分,并希望包含缺少/多个值。



我从一个XML文件开始,并希望根据XML元素的内容定义的一些数据,在R中构建一个数据帧。以一个例子来解释是最简单的。在下面,我想选出每个城市的地标信息(即使没有地标元素也有几个),忽略有关电台的信息。

 < world> 
< city>
< name>伦敦< / name>
< buildings>
< building>
< type> landmark< / type>
< bname>塔桥< / bname>
< / building>
< building>
< type> station< / type>
< bname>滑铁卢< / bname>
< / building>
< / buildings>
< / city>
< city>
< name>纽约< / name>
< buildings>
< building>
< type> station< / type>
< bname> Grand Central< / bname>
< / building>
< / buildings>
< / city>
< city>
< name> Paris< / name>
< buildings>
< building>
< type> landmark< / type>
< bname>艾菲尔铁塔< / bname>
< / building>
< building>
< type> landmark< / type>
< bname> Louvre< / bname>
< / building>
< / buildings>
< / city>
< / world>

理想情况下,这将进入一个如下所示的数据框:

 伦敦塔桥
纽约NA
巴黎埃菲尔铁塔
巴黎卢浮宫
我假设可能有办法使用XML库和 xpathSApply ,但我认为我被殴打。



也不会想到如何短语的问题,而不仅仅是提到这个例子,所以随便编辑一个更具描述性的问题。 p>

解决方案

假设XML数据位于名为的文件中。进入并迭代城市提取城市名称和任何相关地标的 bname

 库(XML)
doc< - xmlParse(world.xml,useInternalNodes = TRUE)

do.call(rbind,xpathApply(doc,/ world / city,function(node){

city< ; - xmlValue(node [[name]])

xp< - ./buildings/building[./type/text()='landmark']/bname
地标< - xpathSApply(node,xp,xmlValue)
if(is.null(landmark))landmark< - NA

data.frame(city,landmark,stringsAsFactors = FALSE )

}))

结果是:

 城市地标
1伦敦塔桥
2纽约< NA>
3巴黎埃菲尔铁塔
4巴黎卢浮宫


This question is similar to a previous question, Import all fields (and subfields) of XML as dataframe, but I want to pull out only a subset of the XML data and want to include missing/multiple values.

I start with an XML file and want to construct a dataframe in R based on some of the data it contains, defined by the contents of XML elements. It is easiest to explain with an example. In the below, I want to pick out the information about landmarks for every city (even if there is no landmark element or there are several) and ignore the information about stations.

<world>
    <city>
        <name>London</name>
        <buildings>
            <building>
                <type>landmark</type>
                <bname>Tower Bridge</bname>
            </building>
            <building>
                <type>station</type>
                <bname>Waterloo</bname>
            </building>
        </buildings>
    </city>
    <city>
        <name>New York</name>
        <buildings>
            <building>
                <type>station</type>
                <bname>Grand Central</bname>
            </building>
        </buildings>
    </city>
    <city>
        <name>Paris</name>
        <buildings>
            <building>
                <type>landmark</type>
                <bname>Eiffel Tower</bname>
            </building>
            <building>
                <type>landmark</type>
                <bname>Louvre</bname>
            </building>
        </buildings>
    </city>
</world>

Ideally this would go into a dataframe that looks something like this:

 London      Tower Bridge
 New York    NA
 Paris       Eiffel Tower
 Paris       Louvre

I assumed there might be a way to do this using the XML library and xpathSApply but I think I'm beaten.

Also couldn't think how to phrase the question without just referring to the example so feel free to edit to give a more descriptive question.

解决方案

Assuming the XML data is in a file called world.xml read it in and iterate over the cities extracting the city name and the bname of any associated landmarks :

library(XML)
doc <- xmlParse("world.xml", useInternalNodes = TRUE)

do.call(rbind, xpathApply(doc, "/world/city", function(node) {

   city <- xmlValue(node[["name"]])

   xp <- "./buildings/building[./type/text()='landmark']/bname"
   landmark <- xpathSApply(node, xp, xmlValue)
   if (is.null(landmark)) landmark <- NA

   data.frame(city, landmark, stringsAsFactors = FALSE)

}))

The result is:

      city     landmark
1   London Tower Bridge
2 New York         <NA>
3    Paris Eiffel Tower
4    Paris       Louvre

这篇关于值为多或缺失时,来自XML的R数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆