将大型复杂的 XML 文件解析为 data.frame [英] Parsing large and complicated XML file to data.frame

查看:19
本文介绍了将大型复杂的 XML 文件解析为 data.frame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我有包含大量报告的大型 XML 文件.我在下面创建了数据示例以大致显示 xml 的大小及其结构:

So, I have large XML file with lots of reports. I created data example below to approximately show the size of xml and its structure:

x <- "<Report><Agreements><AgreementList /></Agreements><CIP><RecordList><Record><Date>2017-05-26T00:00:00</Date><Grade>2</Grade><ReasonsList><Reason><Code>R</Code><Description>local</Description></Reason></ReasonsList><Score>xxx</Score></Record><Record><Date>2017-04-30T00:00:00</Date><Grade>2</Grade><ReasonsList><Reason><Code>R</Code><Description/></Reason></ReasonsList><Score>xyx</Score></Record></RecordList></CIP><Individual><Contact><Email/></Contact><General><FirstName>MM</FirstName></General></Individual><Inquiries><InquiryList><Inquiry><DateOfInquiry>2017-03-19</DateOfInquiry><Reason>cc</Reason></Inquiry><Inquiry><DateOfInquiry>2016-10-14</DateOfInquiry><Reason>er</Reason></Inquiry></InquiryList><Summary><NumberOfInquiries>2</NumberOfInquiries></Summary></Inquiries></Report>"

x <- paste(rep(x, 1.5e+5), collapse = "")
x <- paste0("<R>", x, "</R>")
require(XML)
p <- xmlParse(x)
p <- xmlRoot(p)
p[[1]]

我想将此数据转换为 data.frame,但 XML 的结构并不简单.以前使用 XML,我创建了循环,每个报告都将其子节点转换为 data.frame,但在这里(在此数据中)子节点计数大于 30(没有将所有这些都放在示例中),并且结构不同(列表节点甚至可以在 XML 中出现 2 层).

I would like to transform this data to data.frame, but the structure of XML isn't straightforward. Previously working with XMLs I created loop that for every report transforms its sub nodes to data.frame, but here (in this data) the sub node count is greater than 30 (didn't put all of them in the example), and the structure differs (List nodes can occur even 2 levels deep in XML).

所以我有几个问题:

1) 我确信循环报告并不是处理此问题的最佳方式.我应该如何解决这个问题?

1) I am sure that looping over reports isn't the best way to handle this. How should I approach this problem?

2)我可以以某种方式提取一份报告的所有数据,两行data.frame(可能递归)?

2) Can I somehow extract all the data of one report two one line of data.frame (recursively maybe)?

3) 或者我可以为 XML 的每个列表对象自动创建单独的 data.frames 吗?

3) Or can I automatically create separate data.frames for each list object of XML?

任何帮助将不胜感激.

结果示例如下所示:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  17 variables:
 $ Record.1.Date                : chr "2017-05-26T00:00:00"
 $ Record.1.Grade               : num 2
 $ Record.1.Reason.1.Code       : chr "R"
 $ Record.1.Reason.1.Description: chr "local"
 $ Record.1.Score               : chr "xxx"
 $ Record.2.Date                : chr "2017-05-26T00:00:00"
 $ Record.2.Grade               : num 2
 $ Record.2.Reason.1.Code       : chr "R"
 $ Record.2.Reason.1.Description: chr "NA"
 $ Record.2.Score               : chr "xyx"
 $ Email.1                      : chr "NA"
 $ FirstName                    : chr "MM"
 $ Inquiry.1.DateOfInquiry      : POSIXct, format: "2017-03-19"
 $ Inquiry.1.Reason             : chr "cc"
 $ Inquiry.2.DateOfInquiry      : POSIXct, format: "2016-10-14"
 $ Inquiry.2.Reason             : chr "er"
 $ NumberOfInquiries            : num 2

,但正如我之前提到的,子列表也可以在单独的表中.

, but as I mentioned previously, sub lists could also be in separate tables.

推荐答案

L=xmlToList(x)
str(data.frame(t(unlist(L)), stringsAsFactors=FALSE))
# 'data.frame': 1 obs. of  15 variables:
#  $ CIP.RecordList.Record.Date                          : chr "2017-05-26T00:00:00"
#  $ CIP.RecordList.Record.Grade                         : chr "2"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Code       : chr "R"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Description: chr "local"
#  $ CIP.RecordList.Record.Score                         : chr "xxx"
#  $ CIP.RecordList.Record.Date.1                        : chr "2017-04-30T00:00:00"
#  $ CIP.RecordList.Record.Grade.1                       : chr "2"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Code.1     : chr "R"
#  $ CIP.RecordList.Record.Score.1                       : chr "xyx"
#  $ Individual.General.FirstName                        : chr "MM"
#  $ Inquiries.InquiryList.Inquiry.DateOfInquiry         : chr "2017-03-19"
#  $ Inquiries.InquiryList.Inquiry.Reason                : chr "cc"
#  $ Inquiries.InquiryList.Inquiry.DateOfInquiry.1       : chr "2016-10-14"
#  $ Inquiries.InquiryList.Inquiry.Reason.1              : chr "er"
#  $ Inquiries.Summary.NumberOfInquiries                 : chr "2"

如果要将具有合适表示的字符串转换为数字,假设 df 是上面的数据框:

If you want to convert strings that have a suitable representation as numbers, assuming that df is the data frame above:

data.frame(t(lapply(df, function(x) 
               ifelse(is.na(y<-suppressWarnings(as.numeric(x))), x, y))))

没有数字表示的字符串将不会被转换.

Strings that do not have a number representation will not be converted.

动机

A) 在一些评论中,OP 添加了对执行速度的进一步要求,这对于诸如数据导入之类的一次性任务通常不是问题.正如问题中明确要求的那样,上述解决方案基于递归.当然,上下遍历节点会增加很多开销.

A) In some comments the OP added a further request for execution speed, which is normally not a issue for one time tasks such as data import. The solution above is based on recursion, as explicitly required in the question. Of course, traversing up and down the nodes adds a lot of overhead.

B)这里最近的一个答案提出了一种基于外部工具集合的复杂方法.当然可能有不同的好工具来管理 XML 文件,但恕我直言,大部分 XPATH 工作都可以在 R 本身中轻松高效地完成.

B) One recent answer here proposes a complex method based on a collection of external tools. There might of course be different nice utilities to manage XML files, but IMHO much of the XPATH work can be comfortably and efficiently done in R itself.

C) OP 想知道是否可以为 XML 的每个列表对象创建单独的 data.frames".

C) The OP wonders if it is possible to "create separate data.frames for each list object of XML".

D) 我注意到在问题标签中,OP(似乎)需要更新的 xml2 包.

D) I noticed that in the question tags, the OP (seems to) require the newer xml2 package.

我直接使用 R 中的 XPATH 来解决上述问题.

I address the points above using XPATH straight from R.

XPATH 方法

下面我在一个单独的数据框中提取 Record 节点.也可以对其他(子)节点使用相同的方法.

Below I extract in a separate data frame the Record node. One can use the same approach for other (sub)nodes too.

library(xml2)
xx=read_xml(x)                                                                              
xx=(xml_find_all(xx, "//Record"))
system.time(
    xx <- xml_find_all(xx, ".//descendant::*[not(*)]"))
#  user  system elapsed 
# 38.00    0.36   38.35 
system.time(xx <- xml_text(xx))
#  user  system elapsed 
# 68.39    0.05   68.53 
head(data.frame(t(matrix(xx, 5))))
#                    X1 X2 X3    X4  X5
# 1 2017-05-26T00:00:00  2  R local xxx
# 2 2017-04-30T00:00:00  2  R       xyx
# 3 2017-05-26T00:00:00  2  R local xxx
# 4 2017-04-30T00:00:00  2  R       xyx
# 5 2017-05-26T00:00:00  2  R local xxx
# 6 2017-04-30T00:00:00  2  R       xyx

(您可能需要添加更多代码来命名数据框列)

(You might want to add further code to name data frame columns)

时间指的是我的平均笔记本电脑.

Time is referred to my average laptop.

说明

解决方案的核心在于XPATH .//descendant::*[not(*)]..//descendant:: 提取当前上下文的所有后代(Record 节点);添加 [not(*)] 进一步扁平化布局.这允许线性化树结构,使其更适合数据科学建模.

The core of the solutions lies in the XPATH .//descendant::*[not(*)]. .//descendant:: extracts all descendants of the current context (the Record node); adding [not(*)] further flattens the layout. This allows to linearize a tree structure, making it more for suitable for data science modeling.

* 的灵活性是以计算为代价的.然而,计算负担并不在于 R,它是一种解释型语言,而是以高效的外部 C 库 libxml2.结果应该与其他实用程序和库的结果相同或更好.

The flexibility of * comes at a price in terms of computation. However, the computational burden does no lie on R, which is an interpreted language, but comes at the expenses of the highly efficient external C library libxml2. Results should be equal or better than those of other utilities and libraries.

这篇关于将大型复杂的 XML 文件解析为 data.frame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆