xml与R中的数据框嵌套兄弟 [英] xml with nested siblings to data frame in R

查看:179
本文介绍了xml与R中的数据框嵌套兄弟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很喜欢在R中解析XML。我试图将XML解析成可行的数据框架。我已经从XML包中尝试了一些XPath函数,但似乎无法达到正确的答案。



这是我的XML:

 < ResidentialProperty> 
< Listing>
< StreetAddress>
< StreetNumber> 11111< / StreetNumber>
< StreetName>第111位< / StreetName>
< StreetSuffix> Avenue Ct< / StreetSuffix>
< StateOrProvince> WA< / StateOrProvince>
< / StreetAddress>
< MLSInformation>
< ListingStatus Status =Active/>
< StatusChangeDate> 2015-07-05T23:48:53.410< / StatusChangeDate>
< / MLSInformation>
<地理数据>
< Latitude> 11.111111< / Latitude>
<经度> -111.111111< /经度>
<县> Pierce< /县>
< / GeographicData>
< SchoolData>
< SchoolDistrict> Puyallup< / SchoolDistrict>
< / SchoolData>
< View> Territorial< / View>
< / Listing>
< YearBuilt> 1997< / YearBuilt>
< InteriorFeatures> Bath Off Master,Dbl Pane / Storm Windw< / InteriorFeatures>
< Occupant>
< Name>空闲< / Name>
< / Occupant>
< WaterFront />
<屋顶>组成< / Roof>
<外部>砖,水泥板,木材,木制品< /
< / ResidentialProperty>

当我运行:

  ResidentialProperty<  -  xmlToDataFrame(nodes = getNodeSet(doc,// ResidentialProperty))

父节点中的子节点的值被压缩为:

  11111111thAvenue CtWA2015-07-05T23: 48:53.41011.111111-111.111111PiercePuyallupTerritorial 

如果我向下移动一个节点,同样的事情发生:

  11111111thAvenue CtWA 

子节点的值都被粘贴在一起。



我还尝试了一种有力的方法:

  StreetAddress<  -  xmlToDataFrame(nodes = getNodeSet(doc,// StreetAddress))
MLSInformation< - xmlToDataFrame(nodes = getNodeSet ,// MLSInformation))
GeographicData< - xmlToDataFrame(nodes = getNodeSet(doc,// GeographicData))
SchoolData< - xmlToDataFrame(nodes = getN odeSet(doc,// SchoolData))
YearBuilt < - xmlToDataFrame(nodes = getNodeSet(doc,// YearBuilt))
InteriorFeatures< - xmlToDataFrame(nodes = getNodeSet(doc, // InteriorFeatures))
占用者< - xmlToDataFrame(nodes = getNodeSet(doc,// Occupant))
屋顶< - xmlToDataFrame(nodes = getNodeSet(doc, ))
外部< - xmlToDataFrame(nodes = getNodeSet(doc,//外观))
df< - cbind(StreetAddress,MLSInformation,GeographicData,SchoolData,YearBuilt,InteriorFeatures,Occupant,屋顶,外部)

但某些列名未按预期方式:

 > colnames(df)
[1]StreetNumberStreetNameStreetSuffixStateOrProvinceListingStatus
[6]StatusChangeDateLatitude经度县SchoolDistrict
[11]texttextNametexttext

code> colnames [11,12,14,15] 应为YearBuilt,InteriorFeatures,Roof和Exterior (旁注 - 为什么会发生这种情况?)



我正在尝试找到一种方法,将每个原子值排序为数据框的相应列,列名称为节点的名称,甚至在嵌套的子节点中。此外,我的数据可能会随着时间的推移而改变,所以我正在寻找一个符合数据的动态函数,如果可能,产生预期的结果。



我想象这是一个有些常见的XML模式(层次嵌套的孩子),所以我很惊讶没有找到关于这个主题的很多信息,虽然我可能只是在我的搜索中使用错误的术语。我猜这是一个简单的答案。你有什么建议吗?

解决方案

考虑到 xml 这是住宅物业的另一个策略,其中有不同数量的项目:

 库(XML)
库(plyr)
#xml< - '< ResidentialProperty> ........'
doc< - xmlParse(xml,asText = TRUE)
df< - do.call (rbind.fill,lapply(doc ['// ResidentialProperty'],function(x){
names< - xpathSApply(x,'.//。',xmlName)
名称<名称[which(name ==text) - 1]
值< - xpathSApply(x,.// text(),xmlValue)
return(as.data.frame (setNames(values,names)),stringsAsFactors = FALSE))
})
df
#StreetNumber StreetName StreetSuffix StateOrProvince StatusChangeDate纬度经度县学校部门查看年份内部特征名称屋顶外观
#1 11111 111th Avenue Ct WA 2015-07-05T23:48:53.410 11.111111 -111.111111 Pierce Puyallup Territorial 1997 Bath Off Master,Dbl Pane / Storm Windw空置组合砖,水泥板,木材,木制品


I am new to parsing XML in R. I am trying to parse XML into a workable data frame. I have tried some XPath functions from the XML package but cannot seem to arrive at the correct answer.

Here is my XML:

<ResidentialProperty>
    <Listing>
      <StreetAddress>
        <StreetNumber>11111</StreetNumber>
        <StreetName>111th</StreetName>
        <StreetSuffix>Avenue Ct</StreetSuffix>
        <StateOrProvince>WA</StateOrProvince>
      </StreetAddress>
      <MLSInformation>
        <ListingStatus Status="Active"/>
        <StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate>
      </MLSInformation>
      <GeographicData>
        <Latitude>11.111111</Latitude>
        <Longitude>-111.111111</Longitude>
        <County>Pierce</County>
      </GeographicData>
      <SchoolData>
        <SchoolDistrict>Puyallup</SchoolDistrict>
      </SchoolData>
      <View>Territorial</View>
    </Listing>
    <YearBuilt>1997</YearBuilt>
    <InteriorFeatures>Bath Off Master,Dbl Pane/Storm Windw</InteriorFeatures>
    <Occupant>
      <Name>Vacant</Name>
    </Occupant>
    <WaterFront/>
    <Roof>Composition</Roof>
    <Exterior>Brick,Cement Planked,Wood,Wood Products</
</ResidentialProperty>

When I run:

ResidentialProperty <- xmlToDataFrame(nodes=getNodeSet(doc,"//ResidentialProperty"))

The values of the child nodes within the parent node is compressed to:

11111111thAvenue CtWA2015-07-05T23:48:53.41011.111111-111.111111PiercePuyallupTerritorial

If I move down one node to , the same thing happens:

11111111thAvenue CtWA

The values of the child nodes are all pasted together.

I also tried a brute force method which worked somewhat:

StreetAddress <- xmlToDataFrame(nodes=getNodeSet(doc,"//StreetAddress"))
MLSInformation <- xmlToDataFrame(nodes=getNodeSet(doc,"//MLSInformation"))
GeographicData <- xmlToDataFrame(nodes=getNodeSet(doc,"//GeographicData"))
SchoolData <- xmlToDataFrame(nodes=getNodeSet(doc,"//SchoolData"))
YearBuilt <- xmlToDataFrame(nodes=getNodeSet(doc,"//YearBuilt"))
InteriorFeatures <- xmlToDataFrame(nodes=getNodeSet(doc,"//InteriorFeatures"))
Occupant <- xmlToDataFrame(nodes=getNodeSet(doc,"//Occupant"))
Roof <- xmlToDataFrame(nodes=getNodeSet(doc,"//Roof"))
Exterior <- xmlToDataFrame(nodes=getNodeSet(doc,"//Exterior"))
df <- cbind(StreetAddress, MLSInformation, GeographicData, SchoolData, YearBuilt, InteriorFeatures, Occupant, Roof, Exterior)

but some of the column names were not as expected:

> colnames(df)
 [1] "StreetNumber"     "StreetName"       "StreetSuffix"     "StateOrProvince"  "ListingStatus"   
 [6] "StatusChangeDate" "Latitude"         "Longitude"        "County"           "SchoolDistrict"  
[11] "text"             "text"             "Name"             "text"             "text"    

colnames[11,12,14,15] should be "YearBuilt", "InteriorFeatures", "Roof", and "Exterior" respectively. (Side note - why does this happen?)

I am trying to find a way to sort each atomic value into an appropriate column of a data frame with the column names being the names of the nodes, even within nested children nodes. Also, my data may change over time, so I'm looking for a dynamic function to conform to the data, producing expected results if possible.

I imagine this is a somewhat common XML schema (with layers of nested children) so I am surprised to not find much info on the topic, though I may simply using the wrong jargon in my searches. It my guess that there is a simple answer. Do you have any suggestions?

解决方案

Considering xml holds your example string, here's another strategy for Residential Properties with a varying number of items:

library(XML)
library(plyr) 
# xml <- '<ResidentialProperty>........'
doc <- xmlParse(xml, asText =  TRUE)
df <- do.call(rbind.fill, lapply(doc['//ResidentialProperty'], function(x) { 
  names <- xpathSApply(x, './/.', xmlName) 
  names <- names[which(names == "text") - 1]
  values <- xpathSApply(x, ".//text()", xmlValue)
  return(as.data.frame(t(setNames(values, names)), stringsAsFactors = FALSE))
}))
df
#   StreetNumber StreetName StreetSuffix StateOrProvince        StatusChangeDate  Latitude   Longitude County SchoolDistrict        View YearBuilt                     InteriorFeatures   Name        Roof                                Exterior
# 1        11111      111th    Avenue Ct              WA 2015-07-05T23:48:53.410 11.111111 -111.111111 Pierce       Puyallup Territorial      1997 Bath Off Master,Dbl Pane/Storm Windw Vacant Composition Brick,Cement Planked,Wood,Wood Products

这篇关于xml与R中的数据框嵌套兄弟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆