将(可能格式错误的)xml 转换为 R 中的数据帧 [英] Convert (possibly malformed) xml into Data Frame in R
问题描述
我正在尝试将美国联邦公报存档中的 xml 文件转换为数据框,其中每一行对应于特定操作(例如,通知、规则、提议规则),并且每列包含与该操作相关的属性(例如,代理类型、主题等).我尝试了以下方法:
I am trying to convert an xml file from the US Federal Register archive into a data frame where each row corresponds to a particular action (e.g., Notice, Rule, Proposed Rule) and each column contains an attribute related to that action (e.g., Agency Type, Subject, etc). I have tried the following:
> setwd("C:/Users/mwilliamson/Desktop/FedReg/2000/01/")
> url = "FR-2000-01-18.xml"
> doc <- xmlInternalTreeParse("FR-2000-01-18.xml")
> doc_list <- xmlToList(doc)
> library(plyr)
> j <- ldply(doc_list, data.frame)
然而,它返回一个错误:
However, it returns an error:
Error in data.frame(SECTNO = "§ 831.502", SUBJECT = "Automatic separation;
exemption.", :
arguments imply differing number of rows: 1, 0
似乎空白值的数量和变量长度的差异在 R 处理 XML 时产生了问题(我可能在这里错了,对 xml 包的经验不多).我认为可能可以使用架构 (.xsd) 文件来避免这种情况,但尚不清楚我如何将架构与 xmlToList 一起使用.本质上,我正在寻找将 xml 处理到我描述的数据框中并用 NA 填充任何空白单元格的最佳"方法.我已将架构和示例文件上传到:
It appears that the number of blank values and the differences in the length of the variables is creating an issue as R processes the XML (I may be wrong here, not much experience with the xml package). I thought it might be possible to use the schema (.xsd) file to avoid this, but it is not clear how I use the schema with xmlToList. Essentially, I am looking for the "best" way to process the xml into the data frame I described and fill any blank cells with NA. I have uploaded a the schema and a sample file to:
https://www.dropbox.com/sh/pluje12t185w1v2/ys1xHzilQO
你能提供的任何帮助都会很棒!!
Any help you can provide would be great!!
更新:我也试过:
xmlToDataFrame(doc, colClasses = character, homogeneous = NA)
但收到以下信息:
Error: duplicate subscripts for columns
再次感谢您提供的任何帮助.
Again, many thanks for any help you might offer.
更新:看来/AGENCY 节点是数据开始真正适合我尝试创建的格式的地方;但是,我似乎无法提取所有其余数据(即,我可以获取包含 115 条记录标识机构的单列,但无法获取与这 115 条记录相关的其余信息).我尝试了以下方法:
UPDATE: It appears that the /AGENCY node is where the data begins to actually fit the format I am attempting to create; however, I can't seem to extract all of the rest of the data (i.e., I can get a single column with 115 records identifying the agency, but can't get the rest of the information related to those 115 records). I have tried the following:
out <- getNodeSet(doc, "//*", fun=xmlToList)
df <- data.frame(do.call(rbind, out))
head(df)
但它似乎导致R崩溃.我希望我的持续更新会激励有人伸出援手.再次感谢您提供的任何帮助.
but it seems to cause R to crash. I am hoping that my continued updates will inspire someone to lend a hand. Thanks again for any help you can give.
推荐答案
这个 XML 一团糟,我猜你需要分别解析每个动作.
This XML is a mess and my guess is that you'll need to parse each action separately.
table(xpathSApply(doc, "//FEDREG/child::node()", xmlName))
DATE NEWPART NO NOTICES PRESDOCS PRORULES RULES UNITNAME VOL
12 6 12 1 3 1 1 12 12
table(xpathSApply(doc, "//NOTICES/child::node()", xmlName))
NOTICE
92
使用 getNodeSet 获取通知
Get notices using getNodeSet
z <- getNodeSet(doc, "//NOTICE")
z[[1]]
# check node names
sapply(z, xmlSApply, xmlName)
x <- xmlToDataFrame(z)
dim(x)
[1] 92 4
因此,这是从 PREAMB 和 SUPLINFO 中混合了大量细节,因此您可能需要分别解析这些节点.
So this is mashing lots of details from PREAMB and SUPLINFO, so you may need to parse those nodes separately.
如果你只拿 PREAMB,那也是一团糟...
If you just take PREAMB, that's also a mess...
z2 <- getNodeSet(doc, "//NOTICE/PREAMB")
# check node names and notice different formats
sapply(z2, xmlSApply, xmlName)
## and count
sort( table(unlist(sapply(z2, xmlSApply, xmlName))) )
AUTH BILCOD NOTE GPOTABLE STARS PRTPAGE DATE FTNT GPH EFFDATE ADD DATES FP SIG DEPDOC EXTRACT SUM
2 3 3 5 5 8 15 15 15 16 19 24 32 37 45 47 52
AGY FURINF SUBAGY ACT AGENCY SUBJECT HD P
54 54 55 57 92 92 103 663
我在这里看到了三种不同的格式,所以 xmlToDataFrame 将适用于某些节点,但不是所有
I see three different formats here, so xmlToDataFrame will work with some nodes but not all
x <- xmlToDataFrame(z2[1:4])
将这 10 列与代码中 ldply 的结果进行比较
Compare these 10 columns to results from ldply in your code
doc_list <- getNodeSet(doc, "//NOTICE/PREAMB", fun=xmlToList)
## this returns 31 columns since it grabs every child node...
j <- ldply(doc_list[1:4], data.frame)
names(j)
我认为有时循环遍历 getNodeSet 结果并解析您需要的内容会更好,如果节点不存在,请确保添加 NA(此处使用 xp 函数).有关创建子文档和使用 free 修复内存泄漏的信息,请参阅 ?getNodeSet,但对于最常见的格式可能是这样的.您可以为带有大量 HD、EXTRACT 和 P 标签的通知添加检查和获取附加列.
I think it's sometimes better to just loop through the getNodeSet results and parse what you need, making sure to add NAs if the node is not present (using the xp function here). See ?getNodeSet on creating sub docs and fixing the memory leak using free, but maybe something like this for the most common format. You could add checks and grab additional columns for Notices with lots of HD, EXTRACT and P tags.
xp <- function (doc, tag){
n <- xpathSApply(doc, tag, xmlValue)
if (length(n) > 0)
# paste multiple values? BILCOD and probably others..
paste0(n, collapse="; ")
else NA
}
z <- getNodeSet(doc, "//NOTICE")
n <-length(z)
notices <-vector("list",n)
for(i in 1:n)
{
z2<-xmlDoc(z[[i]])
notices[[i]] <- data.frame(
AGENCY = xp(z2, "//AGENCY"),
SUBAGY = xp(z2, "//SUBAGY"),
SUBJECT = xp(z2, "//PREAMB/SUBJECT"), ## SUBJECT node in SECTION too, so it helps to be as specific as possible
ACT= xp(z2, "//ACT"),
SUM = xp(z2, "//SUM"),
DATES = xp(z2, "//DATES"),
ADD = xp(z2, "//ADD"),
FURINF = xp(z2, "//FURINF"),
SIG = xp(z2, "//PREAMB/SIG"), ## SIG in SUPLINF too
SUPLINF = xp(z2, "//SUPLINF"),
FRDOC = xp(z2, "//FRDOC"),
BILCOD = xp(z2, "//BILCOD"),
DEPDOC = xp(z2, "//DEPDOC"),
PRTPAGE = xp(z2, "//PRTPAGE"),
stringsAsFactors=FALSE)
free(z2)
}
x <- do.call("rbind", notices)
head(x)
table(is.na(x$ACT) )
FALSE TRUE
57 35
您仍然拥有像 SUPLINF 这样的列,其中将大量结构化数据混合在一起 - 如果需要,您可以将其分解...
You still have columns like SUPLINF with lots of structured data mashed together - you could break that up if needed...
table(xpathSApply(doc, "//NOTICE/SUPLINF/child::node()", xmlName))
AMDPAR APPENDIX AUTH BILCOD DATE EXTRACT FP FTNT GPH GPOTABLE HD LSTSUB P PRTPAGE SIG text
1 1 10 1 4 10 23 31 10 12 186 1 783 4 52 1
xpathSApply(doc, "//NOTICE/SUPLINF/GPH", xmlValue)
[1] "EN18JA00.000" "EN18JA00.001" "EN18JA00.002" "EN18JA00.003" "EN18JA00.004" "EN18JA00.005" "EN18JA00.006" "EN18JA00.007" "EN18JA00.008" "EN18JA00.009"
## since SIG is in PREAMB and SUPLINF, you may want to parse that separately
xpathSApply(doc, "//NOTICE/SUPLINF/SIG", xmlValue)
这篇关于将(可能格式错误的)xml 转换为 R 中的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!