将(可能格式错误的)xml 转换为 R 中的数据帧 [英] Convert (possibly malformed) xml into Data Frame in R

查看:22
本文介绍了将(可能格式错误的)xml 转换为 R 中的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将美国联邦公报存档中的 xml 文件转换为数据框,其中每一行对应于特定操作(例如,通知、规则、提议规则),并且每列包含与该操作相关的属性(例如,代理类型、主题等).我尝试了以下方法:

I am trying to convert an xml file from the US Federal Register archive into a data frame where each row corresponds to a particular action (e.g., Notice, Rule, Proposed Rule) and each column contains an attribute related to that action (e.g., Agency Type, Subject, etc). I have tried the following:

> setwd("C:/Users/mwilliamson/Desktop/FedReg/2000/01/")
> url = "FR-2000-01-18.xml"
> doc <- xmlInternalTreeParse("FR-2000-01-18.xml")
> doc_list <- xmlToList(doc)
> library(plyr)
> j <- ldply(doc_list, data.frame)

然而,它返回一个错误:

However, it returns an error:

Error in data.frame(SECTNO = "§ 831.502", SUBJECT = "Automatic separation;  
exemption.",  : 
arguments imply differing number of rows: 1, 0

似乎空白值的数量和变量长度的差异在 R 处理 XML 时产生了问题(我可能在这里错了,对 xml 包的经验不多).我认为可能可以使用架构 (.xsd) 文件来避免这种情况,但尚不清楚我如何将架构与 xmlToList 一起使用.本质上,我正在寻找将 xml 处理到我描述的数据框中并用 NA 填充任何空白单元格的最佳"方法.我已将架构和示例文件上传到:

It appears that the number of blank values and the differences in the length of the variables is creating an issue as R processes the XML (I may be wrong here, not much experience with the xml package). I thought it might be possible to use the schema (.xsd) file to avoid this, but it is not clear how I use the schema with xmlToList. Essentially, I am looking for the "best" way to process the xml into the data frame I described and fill any blank cells with NA. I have uploaded a the schema and a sample file to:

https://www.dropbox.com/sh/pluje12t185w1v2/ys1xHzilQO

你能提供的任何帮助都会很棒!!

Any help you can provide would be great!!

更新:我也试过:

xmlToDataFrame(doc, colClasses = character, homogeneous = NA)

但收到以下信息:

Error: duplicate subscripts for columns

再次感谢您提供的任何帮助.

Again, many thanks for any help you might offer.

更新:看来/AGENCY 节点是数据开始真正适合我尝试创建的格式的地方;但是,我似乎无法提取所有其余数据(即,我可以获取包含 115 条记录标识机构的单列,但无法获取与这 115 条记录相关的其余信息).我尝试了以下方法:

UPDATE: It appears that the /AGENCY node is where the data begins to actually fit the format I am attempting to create; however, I can't seem to extract all of the rest of the data (i.e., I can get a single column with 115 records identifying the agency, but can't get the rest of the information related to those 115 records). I have tried the following:

out <- getNodeSet(doc, "//*", fun=xmlToList)
df <- data.frame(do.call(rbind, out))
head(df)

但它似乎导致R崩溃.我希望我的持续更新会激励有人伸出援手.再次感谢您提供的任何帮助.

but it seems to cause R to crash. I am hoping that my continued updates will inspire someone to lend a hand. Thanks again for any help you can give.

推荐答案

这个 XML 一团糟,我猜你需要分别解析每个动作.

This XML is a mess and my guess is that you'll need to parse each action separately.

table(xpathSApply(doc, "//FEDREG/child::node()", xmlName))
    DATE  NEWPART       NO  NOTICES PRESDOCS PRORULES    RULES UNITNAME      VOL 
      12        6       12        1        3        1        1       12       12 

table(xpathSApply(doc, "//NOTICES/child::node()", xmlName))
   NOTICE 
       92 

使用 getNodeSet 获取通知

Get notices using getNodeSet

z <- getNodeSet(doc, "//NOTICE")
z[[1]]
# check node names
sapply(z, xmlSApply, xmlName)
x <- xmlToDataFrame(z)
dim(x)
[1] 92  4

因此,这是从 PREAMB 和 SUPLINFO 中混合了大量细节,因此您可能需要分别解析这些节点.

So this is mashing lots of details from PREAMB and SUPLINFO, so you may need to parse those nodes separately.

如果你只拿 PREAMB,那也是一团糟...

If you just take PREAMB, that's also a mess...

z2 <- getNodeSet(doc, "//NOTICE/PREAMB")
# check node names and notice different formats
sapply(z2, xmlSApply, xmlName)
## and count
sort( table(unlist(sapply(z2, xmlSApply, xmlName))) )
AUTH   BILCOD     NOTE GPOTABLE    STARS  PRTPAGE     DATE     FTNT      GPH  EFFDATE      ADD    DATES       FP      SIG   DEPDOC  EXTRACT      SUM 
   2        3        3        5        5        8       15       15       15       16       19       24       32       37       45       47       52 
 AGY   FURINF   SUBAGY      ACT   AGENCY  SUBJECT       HD        P 
  54       54       55       57       92       92      103      663 

我在这里看到了三种不同的格式,所以 xmlToDataFrame 将适用于某些节点,但不是所有

I see three different formats here, so xmlToDataFrame will work with some nodes but not all

x <- xmlToDataFrame(z2[1:4])

将这 10 列与代码中 ldply 的结果进行比较

Compare these 10 columns to results from ldply in your code

doc_list <-  getNodeSet(doc, "//NOTICE/PREAMB", fun=xmlToList)
## this returns 31 columns since it grabs every child node...
j <- ldply(doc_list[1:4], data.frame)
names(j)

我认为有时循环遍历 getNodeSet 结果并解析您需要的内容会更好,如果节点不存在,请确保添加 NA(此处使用 xp 函数).有关创建子文档和使用 free 修复内存泄漏的信息,请参阅 ?getNodeSet,但对于最常见的格式可能是这样的.您可以为带有大量 HD、EXTRACT 和 P 标签的通知添加检查和获取附加列.

I think it's sometimes better to just loop through the getNodeSet results and parse what you need, making sure to add NAs if the node is not present (using the xp function here). See ?getNodeSet on creating sub docs and fixing the memory leak using free, but maybe something like this for the most common format. You could add checks and grab additional columns for Notices with lots of HD, EXTRACT and P tags.

xp <- function (doc, tag){
   n <- xpathSApply(doc, tag, xmlValue)
   if (length(n) > 0) 
      # paste multiple values?  BILCOD and probably others..
      paste0(n, collapse="; ") 
   else NA
}


  z <- getNodeSet(doc, "//NOTICE")
  n <-length(z)
  notices <-vector("list",n)
  for(i in 1:n)
  {
     z2<-xmlDoc(z[[i]])
     notices[[i]] <- data.frame(
      AGENCY = xp(z2, "//AGENCY"),
      SUBAGY = xp(z2, "//SUBAGY"),
      SUBJECT = xp(z2, "//PREAMB/SUBJECT"),    ##  SUBJECT node in SECTION too, so it helps to be as specific as possible
      ACT= xp(z2, "//ACT"),
      SUM = xp(z2, "//SUM"),
      DATES = xp(z2, "//DATES"),
      ADD = xp(z2, "//ADD"),
      FURINF = xp(z2, "//FURINF"),
      SIG = xp(z2, "//PREAMB/SIG"),     ## SIG in SUPLINF too
      SUPLINF = xp(z2, "//SUPLINF"),
      FRDOC = xp(z2, "//FRDOC"),
      BILCOD = xp(z2, "//BILCOD"),
      DEPDOC = xp(z2, "//DEPDOC"),
      PRTPAGE = xp(z2, "//PRTPAGE"),
       stringsAsFactors=FALSE)
     free(z2)  
  }
  x <- do.call("rbind", notices)
  head(x)
  table(is.na(x$ACT) )
  FALSE  TRUE 
     57    35 

您仍然拥有像 SUPLINF 这样的列,其中将大量结构化数据混合在一起 - 如果需要,您可以将其分解...

You still have columns like SUPLINF with lots of structured data mashed together - you could break that up if needed...

table(xpathSApply(doc, "//NOTICE/SUPLINF/child::node()", xmlName))

AMDPAR APPENDIX     AUTH   BILCOD     DATE  EXTRACT       FP     FTNT      GPH GPOTABLE       HD   LSTSUB        P  PRTPAGE      SIG     text 
     1        1       10        1        4       10       23       31       10       12      186        1      783        4       52        1 

xpathSApply(doc, "//NOTICE/SUPLINF/GPH", xmlValue)
[1] "EN18JA00.000" "EN18JA00.001" "EN18JA00.002" "EN18JA00.003" "EN18JA00.004" "EN18JA00.005" "EN18JA00.006" "EN18JA00.007" "EN18JA00.008" "EN18JA00.009"
 ## since SIG is in PREAMB and SUPLINF, you may want to parse that separately
 xpathSApply(doc, "//NOTICE/SUPLINF/SIG", xmlValue) 

这篇关于将(可能格式错误的)xml 转换为 R 中的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆