值为多或缺失时,来自XML的R数据帧 [英] R dataframe from XML when values are multiple or missing
问题描述
此问题与以前的问题类似,全部导入XML(作为数据框)的字段(和子字段),但是我想仅提取XML数据的一部分,并希望包含缺少/多个值。
我从一个XML文件开始,并希望根据XML元素的内容定义的一些数据,在R中构建一个数据帧。以一个例子来解释是最简单的。在下面,我想选出每个城市的地标信息(即使没有地标元素也有几个),忽略有关电台的信息。
< world>
< city>
< name>伦敦< / name>
< buildings>
< building>
< type> landmark< / type>
< bname>塔桥< / bname>
< / building>
< building>
< type> station< / type>
< bname>滑铁卢< / bname>
< / building>
< / buildings>
< / city>
< city>
< name>纽约< / name>
< buildings>
< building>
< type> station< / type>
< bname> Grand Central< / bname>
< / building>
< / buildings>
< / city>
< city>
< name> Paris< / name>
< buildings>
< building>
< type> landmark< / type>
< bname>艾菲尔铁塔< / bname>
< / building>
< building>
< type> landmark< / type>
< bname> Louvre< / bname>
< / building>
< / buildings>
< / city>
< / world>
理想情况下,这将进入一个如下所示的数据框:
伦敦塔桥
我假设可能有办法使用XML库和
纽约NA
巴黎埃菲尔铁塔
巴黎卢浮宫
xpathSApply
,但我认为我被殴打。
也不会想到如何短语的问题,而不仅仅是提到这个例子,所以随便编辑一个更具描述性的问题。 p>
解决方案假设XML数据位于名为
的文件中。进入并迭代城市提取城市
名称
和任何相关地标的bname
:库(XML)
doc< - xmlParse(world.xml,useInternalNodes = TRUE)
do.call(rbind,xpathApply(doc,/ world / city,function(node){
city< ; - xmlValue(node [[name]])
xp< - ./buildings/building[./type/text()='landmark']/bname
地标< - xpathSApply(node,xp,xmlValue)
if(is.null(landmark))landmark< - NA
data.frame(city,landmark,stringsAsFactors = FALSE )
}))
结果是:
城市地标
1伦敦塔桥
2纽约< NA>
3巴黎埃菲尔铁塔
4巴黎卢浮宫
This question is similar to a previous question, Import all fields (and subfields) of XML as dataframe, but I want to pull out only a subset of the XML data and want to include missing/multiple values.
I start with an XML file and want to construct a dataframe in R based on some of the data it contains, defined by the contents of XML elements. It is easiest to explain with an example. In the below, I want to pick out the information about landmarks for every city (even if there is no landmark element or there are several) and ignore the information about stations.
<world> <city> <name>London</name> <buildings> <building> <type>landmark</type> <bname>Tower Bridge</bname> </building> <building> <type>station</type> <bname>Waterloo</bname> </building> </buildings> </city> <city> <name>New York</name> <buildings> <building> <type>station</type> <bname>Grand Central</bname> </building> </buildings> </city> <city> <name>Paris</name> <buildings> <building> <type>landmark</type> <bname>Eiffel Tower</bname> </building> <building> <type>landmark</type> <bname>Louvre</bname> </building> </buildings> </city> </world>
Ideally this would go into a dataframe that looks something like this:
London Tower Bridge New York NA Paris Eiffel Tower Paris Louvre
I assumed there might be a way to do this using the XML library and
xpathSApply
but I think I'm beaten.Also couldn't think how to phrase the question without just referring to the example so feel free to edit to give a more descriptive question.
解决方案Assuming the XML data is in a file called
world.xml
read it in and iterate over the cities extracting the cityname
and thebname
of any associated landmarks :library(XML) doc <- xmlParse("world.xml", useInternalNodes = TRUE) do.call(rbind, xpathApply(doc, "/world/city", function(node) { city <- xmlValue(node[["name"]]) xp <- "./buildings/building[./type/text()='landmark']/bname" landmark <- xpathSApply(node, xp, xmlValue) if (is.null(landmark)) landmark <- NA data.frame(city, landmark, stringsAsFactors = FALSE) }))
The result is:
city landmark 1 London Tower Bridge 2 New York <NA> 3 Paris Eiffel Tower 4 Paris Louvre
这篇关于值为多或缺失时,来自XML的R数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!