xml与R中的数据框嵌套兄弟 [英] xml with nested siblings to data frame in R
问题描述
这是我的XML:
< ResidentialProperty>
< Listing>
< StreetAddress>
< StreetNumber> 11111< / StreetNumber>
< StreetName>第111位< / StreetName>
< StreetSuffix> Avenue Ct< / StreetSuffix>
< StateOrProvince> WA< / StateOrProvince>
< / StreetAddress>
< MLSInformation>
< ListingStatus Status =Active/>
< StatusChangeDate> 2015-07-05T23:48:53.410< / StatusChangeDate>
< / MLSInformation>
<地理数据>
< Latitude> 11.111111< / Latitude>
<经度> -111.111111< /经度>
<县> Pierce< /县>
< / GeographicData>
< SchoolData>
< SchoolDistrict> Puyallup< / SchoolDistrict>
< / SchoolData>
< View> Territorial< / View>
< / Listing>
< YearBuilt> 1997< / YearBuilt>
< InteriorFeatures> Bath Off Master,Dbl Pane / Storm Windw< / InteriorFeatures>
< Occupant>
< Name>空闲< / Name>
< / Occupant>
< WaterFront />
<屋顶>组成< / Roof>
<外部>砖,水泥板,木材,木制品< /
< / ResidentialProperty>
当我运行:
ResidentialProperty< - xmlToDataFrame(nodes = getNodeSet(doc,// ResidentialProperty))
父节点中的子节点的值被压缩为:
11111111thAvenue CtWA2015-07-05T23: 48:53.41011.111111-111.111111PiercePuyallupTerritorial
如果我向下移动一个节点,同样的事情发生:
11111111thAvenue CtWA
子节点的值都被粘贴在一起。
我还尝试了一种有力的方法:
StreetAddress< - xmlToDataFrame(nodes = getNodeSet(doc,// StreetAddress))
MLSInformation< - xmlToDataFrame(nodes = getNodeSet ,// MLSInformation))
GeographicData< - xmlToDataFrame(nodes = getNodeSet(doc,// GeographicData))
SchoolData< - xmlToDataFrame(nodes = getN odeSet(doc,// SchoolData))
YearBuilt < - xmlToDataFrame(nodes = getNodeSet(doc,// YearBuilt))
InteriorFeatures< - xmlToDataFrame(nodes = getNodeSet(doc, // InteriorFeatures))
占用者< - xmlToDataFrame(nodes = getNodeSet(doc,// Occupant))
屋顶< - xmlToDataFrame(nodes = getNodeSet(doc, ))
外部< - xmlToDataFrame(nodes = getNodeSet(doc,//外观))
df< - cbind(StreetAddress,MLSInformation,GeographicData,SchoolData,YearBuilt,InteriorFeatures,Occupant,屋顶,外部)
但某些列名未按预期方式:
> colnames(df)
[1]StreetNumberStreetNameStreetSuffixStateOrProvinceListingStatus
[6]StatusChangeDateLatitude经度县SchoolDistrict
[11]texttextNametexttext
code> colnames [11,12,14,15] 应为YearBuilt,InteriorFeatures,Roof和Exterior
(旁注 - 为什么会发生这种情况?)
我正在尝试找到一种方法,将每个原子值排序为数据框的相应列,列名称为节点的名称,甚至在嵌套的子节点中。此外,我的数据可能会随着时间的推移而改变,所以我正在寻找一个符合数据的动态函数,如果可能,产生预期的结果。
我想象这是一个有些常见的XML模式(层次嵌套的孩子),所以我很惊讶没有找到关于这个主题的很多信息,虽然我可能只是在我的搜索中使用错误的术语。我猜这是一个简单的答案。你有什么建议吗?
考虑到 xml
这是住宅物业的另一个策略,其中有不同数量的项目:
库(XML)
库(plyr)
#xml< - '< ResidentialProperty> ........'
doc< - xmlParse(xml,asText = TRUE)
df< - do.call (rbind.fill,lapply(doc ['// ResidentialProperty'],function(x){
names< - xpathSApply(x,'.//。',xmlName)
名称<名称[which(name ==text) - 1]
值< - xpathSApply(x,.// text(),xmlValue)
return(as.data.frame (setNames(values,names)),stringsAsFactors = FALSE))
})
df
#StreetNumber StreetName StreetSuffix StateOrProvince StatusChangeDate纬度经度县学校部门查看年份内部特征名称屋顶外观
#1 11111 111th Avenue Ct WA 2015-07-05T23:48:53.410 11.111111 -111.111111 Pierce Puyallup Territorial 1997 Bath Off Master,Dbl Pane / Storm Windw空置组合砖,水泥板,木材,木制品
I am new to parsing XML in R. I am trying to parse XML into a workable data frame. I have tried some XPath functions from the XML package but cannot seem to arrive at the correct answer.
Here is my XML:
<ResidentialProperty>
<Listing>
<StreetAddress>
<StreetNumber>11111</StreetNumber>
<StreetName>111th</StreetName>
<StreetSuffix>Avenue Ct</StreetSuffix>
<StateOrProvince>WA</StateOrProvince>
</StreetAddress>
<MLSInformation>
<ListingStatus Status="Active"/>
<StatusChangeDate>2015-07-05T23:48:53.410</StatusChangeDate>
</MLSInformation>
<GeographicData>
<Latitude>11.111111</Latitude>
<Longitude>-111.111111</Longitude>
<County>Pierce</County>
</GeographicData>
<SchoolData>
<SchoolDistrict>Puyallup</SchoolDistrict>
</SchoolData>
<View>Territorial</View>
</Listing>
<YearBuilt>1997</YearBuilt>
<InteriorFeatures>Bath Off Master,Dbl Pane/Storm Windw</InteriorFeatures>
<Occupant>
<Name>Vacant</Name>
</Occupant>
<WaterFront/>
<Roof>Composition</Roof>
<Exterior>Brick,Cement Planked,Wood,Wood Products</
</ResidentialProperty>
When I run:
ResidentialProperty <- xmlToDataFrame(nodes=getNodeSet(doc,"//ResidentialProperty"))
The values of the child nodes within the parent node is compressed to:
11111111thAvenue CtWA2015-07-05T23:48:53.41011.111111-111.111111PiercePuyallupTerritorial
If I move down one node to , the same thing happens:
11111111thAvenue CtWA
The values of the child nodes are all pasted together.
I also tried a brute force method which worked somewhat:
StreetAddress <- xmlToDataFrame(nodes=getNodeSet(doc,"//StreetAddress"))
MLSInformation <- xmlToDataFrame(nodes=getNodeSet(doc,"//MLSInformation"))
GeographicData <- xmlToDataFrame(nodes=getNodeSet(doc,"//GeographicData"))
SchoolData <- xmlToDataFrame(nodes=getNodeSet(doc,"//SchoolData"))
YearBuilt <- xmlToDataFrame(nodes=getNodeSet(doc,"//YearBuilt"))
InteriorFeatures <- xmlToDataFrame(nodes=getNodeSet(doc,"//InteriorFeatures"))
Occupant <- xmlToDataFrame(nodes=getNodeSet(doc,"//Occupant"))
Roof <- xmlToDataFrame(nodes=getNodeSet(doc,"//Roof"))
Exterior <- xmlToDataFrame(nodes=getNodeSet(doc,"//Exterior"))
df <- cbind(StreetAddress, MLSInformation, GeographicData, SchoolData, YearBuilt, InteriorFeatures, Occupant, Roof, Exterior)
but some of the column names were not as expected:
> colnames(df)
[1] "StreetNumber" "StreetName" "StreetSuffix" "StateOrProvince" "ListingStatus"
[6] "StatusChangeDate" "Latitude" "Longitude" "County" "SchoolDistrict"
[11] "text" "text" "Name" "text" "text"
colnames[11,12,14,15]
should be "YearBuilt", "InteriorFeatures", "Roof", and "Exterior"
respectively. (Side note - why does this happen?)
I am trying to find a way to sort each atomic value into an appropriate column of a data frame with the column names being the names of the nodes, even within nested children nodes. Also, my data may change over time, so I'm looking for a dynamic function to conform to the data, producing expected results if possible.
I imagine this is a somewhat common XML schema (with layers of nested children) so I am surprised to not find much info on the topic, though I may simply using the wrong jargon in my searches. It my guess that there is a simple answer. Do you have any suggestions?
Considering xml
holds your example string, here's another strategy for Residential Properties with a varying number of items:
library(XML)
library(plyr)
# xml <- '<ResidentialProperty>........'
doc <- xmlParse(xml, asText = TRUE)
df <- do.call(rbind.fill, lapply(doc['//ResidentialProperty'], function(x) {
names <- xpathSApply(x, './/.', xmlName)
names <- names[which(names == "text") - 1]
values <- xpathSApply(x, ".//text()", xmlValue)
return(as.data.frame(t(setNames(values, names)), stringsAsFactors = FALSE))
}))
df
# StreetNumber StreetName StreetSuffix StateOrProvince StatusChangeDate Latitude Longitude County SchoolDistrict View YearBuilt InteriorFeatures Name Roof Exterior
# 1 11111 111th Avenue Ct WA 2015-07-05T23:48:53.410 11.111111 -111.111111 Pierce Puyallup Territorial 1997 Bath Off Master,Dbl Pane/Storm Windw Vacant Composition Brick,Cement Planked,Wood,Wood Products
这篇关于xml与R中的数据框嵌套兄弟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!