带有嵌套兄弟的xml到R中的数据框 [英] xml with nested siblings to data frame in R

查看:18
本文介绍了带有嵌套兄弟的xml到R中的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是在 R 中解析 XML 的新手.我正在尝试将 XML 解析为一个可行的数据框.我尝试了 XML 包中的一些 XPath 函数,但似乎无法得出正确的答案.

I am new to parsing XML in R. I am trying to parse XML into a workable data frame. I have tried some XPath functions from the XML package but cannot seem to arrive at the correct answer.

这是我的 XML:

<ResidentialProperty>
    <Listing>
      <StreetAddress>
        <StreetNumber>11111</StreetNumber>
        <StreetName>111th</StreetName>
        <StreetSuffix>Avenue Ct</StreetSuffix>
        <StateOrProvince>WA</StateOrProvince>
      </StreetAddress>
      <MLSInformation>
        <ListingStatus Status="Active"/>
        <StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate>
      </MLSInformation>
      <GeographicData>
        <Latitude>11.111111</Latitude>
        <Longitude>-111.111111</Longitude>
        <County>Pierce</County>
      </GeographicData>
      <SchoolData>
        <SchoolDistrict>Puyallup</SchoolDistrict>
      </SchoolData>
      <View>Territorial</View>
    </Listing>
    <YearBuilt>1997</YearBuilt>
    <InteriorFeatures>Bath Off Master,Dbl Pane/Storm Windw</InteriorFeatures>
    <Occupant>
      <Name>Vacant</Name>
    </Occupant>
    <WaterFront/>
    <Roof>Composition</Roof>
    <Exterior>Brick,Cement Planked,Wood,Wood Products</
</ResidentialProperty>

当我跑步时:

ResidentialProperty <- xmlToDataFrame(nodes=getNodeSet(doc,"//ResidentialProperty"))

父节点内子节点的值被压缩为:

The values of the child nodes within the parent node is compressed to:

11111111thAvenue CtWA2015-07-05T23:48:53.41011.111111-111.111111PiercePuyallupTerritorial

如果我将一个节点向下移动到 ,也会发生同样的事情:

If I move down one node to , the same thing happens:

11111111thAvenue CtWA

子节点的值全部粘贴在一起.

The values of the child nodes are all pasted together.

我还尝试了一种蛮有效的方法:

I also tried a brute force method which worked somewhat:

StreetAddress <- xmlToDataFrame(nodes=getNodeSet(doc,"//StreetAddress"))
MLSInformation <- xmlToDataFrame(nodes=getNodeSet(doc,"//MLSInformation"))
GeographicData <- xmlToDataFrame(nodes=getNodeSet(doc,"//GeographicData"))
SchoolData <- xmlToDataFrame(nodes=getNodeSet(doc,"//SchoolData"))
YearBuilt <- xmlToDataFrame(nodes=getNodeSet(doc,"//YearBuilt"))
InteriorFeatures <- xmlToDataFrame(nodes=getNodeSet(doc,"//InteriorFeatures"))
Occupant <- xmlToDataFrame(nodes=getNodeSet(doc,"//Occupant"))
Roof <- xmlToDataFrame(nodes=getNodeSet(doc,"//Roof"))
Exterior <- xmlToDataFrame(nodes=getNodeSet(doc,"//Exterior"))
df <- cbind(StreetAddress, MLSInformation, GeographicData, SchoolData, YearBuilt, InteriorFeatures, Occupant, Roof, Exterior)

但有些列名与预期不符:

but some of the column names were not as expected:

> colnames(df)
 [1] "StreetNumber"     "StreetName"       "StreetSuffix"     "StateOrProvince"  "ListingStatus"   
 [6] "StatusChangeDate" "Latitude"         "Longitude"        "County"           "SchoolDistrict"  
[11] "text"             "text"             "Name"             "text"             "text"    

colnames[11,12,14,15] 应分别为 "YearBuilt"、"InteriorFeatures"、"Roof" 和 "Exterior".(旁注 - 为什么会发生这种情况?)

colnames[11,12,14,15] should be "YearBuilt", "InteriorFeatures", "Roof", and "Exterior" respectively. (Side note - why does this happen?)

我试图找到一种方法将每个原子值排序到数据框的适当列中,列名是节点的名称,即使在嵌套的子节点中也是如此.另外,我的数据可能会随着时间而变化,所以我正在寻找一个动态函数来符合数据,如果可能的话产生预期的结果.

I am trying to find a way to sort each atomic value into an appropriate column of a data frame with the column names being the names of the nodes, even within nested children nodes. Also, my data may change over time, so I'm looking for a dynamic function to conform to the data, producing expected results if possible.

我想这是一个有点常见的 XML 模式(具有嵌套的子层),所以我很惊讶没有找到关于该主题的太多信息,尽管我可能只是在搜索中使用了错误的术语.我猜有一个简单的答案.你有什么建议吗?

I imagine this is a somewhat common XML schema (with layers of nested children) so I am surprised to not find much info on the topic, though I may simply using the wrong jargon in my searches. It my guess that there is a simple answer. Do you have any suggestions?

推荐答案

考虑到 xml 包含您的示例字符串,这是具有不同数量项目的 Residential Properties 的另一种策略:

Considering xml holds your example string, here's another strategy for Residential Properties with a varying number of items:

library(XML)
library(plyr) 
# xml <- '<ResidentialProperty>........'
doc <- xmlParse(xml, asText =  TRUE)
df <- do.call(rbind.fill, lapply(doc['//ResidentialProperty'], function(x) { 
  names <- xpathSApply(x, './/.', xmlName) 
  names <- names[which(names == "text") - 1]
  values <- xpathSApply(x, ".//text()", xmlValue)
  return(as.data.frame(t(setNames(values, names)), stringsAsFactors = FALSE))
}))
df
#   StreetNumber StreetName StreetSuffix StateOrProvince        StatusChangeDate  Latitude   Longitude County SchoolDistrict        View YearBuilt                     InteriorFeatures   Name        Roof                                Exterior
# 1        11111      111th    Avenue Ct              WA 2015-07-05T23:48:53.410 11.111111 -111.111111 Pierce       Puyallup Territorial      1997 Bath Off Master,Dbl Pane/Storm Windw Vacant Composition Brick,Cement Planked,Wood,Wood Products

这篇关于带有嵌套兄弟的xml到R中的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆