将大而复杂的XML文件解析为data.frame [英] Parsing large and complicated XML file to data.frame

查看:80
本文介绍了将大而复杂的XML文件解析为data.frame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我有一个包含大量报告的大型XML文件.我在下面创建了数据示例,以大致显示xml的大小及其结构:

So, I have large XML file with lots of reports. I created data example below to approximately show the size of xml and its structure:

x <- "<Report><Agreements><AgreementList /></Agreements><CIP><RecordList><Record><Date>2017-05-26T00:00:00</Date><Grade>2</Grade><ReasonsList><Reason><Code>R</Code><Description>local</Description></Reason></ReasonsList><Score>xxx</Score></Record><Record><Date>2017-04-30T00:00:00</Date><Grade>2</Grade><ReasonsList><Reason><Code>R</Code><Description/></Reason></ReasonsList><Score>xyx</Score></Record></RecordList></CIP><Individual><Contact><Email/></Contact><General><FirstName>MM</FirstName></General></Individual><Inquiries><InquiryList><Inquiry><DateOfInquiry>2017-03-19</DateOfInquiry><Reason>cc</Reason></Inquiry><Inquiry><DateOfInquiry>2016-10-14</DateOfInquiry><Reason>er</Reason></Inquiry></InquiryList><Summary><NumberOfInquiries>2</NumberOfInquiries></Summary></Inquiries></Report>"

x <- paste(rep(x, 1.5e+5), collapse = "")
x <- paste0("<R>", x, "</R>")
require(XML)
p <- xmlParse(x)
p <- xmlRoot(p)
p[[1]]

我想将此数据转换为data.frame,但是XML的结构并不简单.以前使用XML时,我创建了一个循环,每个报告都将其子节点转换为data.frame,但在此数据中,子节点数大于30(示例中未全部包含),并且结构有所不同(列表节点在XML中甚至可以出现2个层次).

I would like to transform this data to data.frame, but the structure of XML isn't straightforward. Previously working with XMLs I created loop that for every report transforms its sub nodes to data.frame, but here (in this data) the sub node count is greater than 30 (didn't put all of them in the example), and the structure differs (List nodes can occur even 2 levels deep in XML).

所以我有几个问题:

1)我确信,循环报表不是解决此问题的最佳方法.我应该如何解决这个问题?

1) I am sure that looping over reports isn't the best way to handle this. How should I approach this problem?

2)我可以以某种方式(递归地)提取一个报表两行一行的所有数据吗?

2) Can I somehow extract all the data of one report two one line of data.frame (recursively maybe)?

3)还是可以为XML的每个列表对象自动创建单独的data.frame?

3) Or can I automatically create separate data.frames for each list object of XML?

任何帮助将不胜感激.

结果示例如下:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1 obs. of  17 variables:
 $ Record.1.Date                : chr "2017-05-26T00:00:00"
 $ Record.1.Grade               : num 2
 $ Record.1.Reason.1.Code       : chr "R"
 $ Record.1.Reason.1.Description: chr "local"
 $ Record.1.Score               : chr "xxx"
 $ Record.2.Date                : chr "2017-05-26T00:00:00"
 $ Record.2.Grade               : num 2
 $ Record.2.Reason.1.Code       : chr "R"
 $ Record.2.Reason.1.Description: chr "NA"
 $ Record.2.Score               : chr "xyx"
 $ Email.1                      : chr "NA"
 $ FirstName                    : chr "MM"
 $ Inquiry.1.DateOfInquiry      : POSIXct, format: "2017-03-19"
 $ Inquiry.1.Reason             : chr "cc"
 $ Inquiry.2.DateOfInquiry      : POSIXct, format: "2016-10-14"
 $ Inquiry.2.Reason             : chr "er"
 $ NumberOfInquiries            : num 2

,但是正如我之前提到的,子列表也可以位于单独的表中.

, but as I mentioned previously, sub lists could also be in separate tables.

推荐答案

L=xmlToList(x)
str(data.frame(t(unlist(L)), stringsAsFactors=FALSE))
# 'data.frame': 1 obs. of  15 variables:
#  $ CIP.RecordList.Record.Date                          : chr "2017-05-26T00:00:00"
#  $ CIP.RecordList.Record.Grade                         : chr "2"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Code       : chr "R"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Description: chr "local"
#  $ CIP.RecordList.Record.Score                         : chr "xxx"
#  $ CIP.RecordList.Record.Date.1                        : chr "2017-04-30T00:00:00"
#  $ CIP.RecordList.Record.Grade.1                       : chr "2"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Code.1     : chr "R"
#  $ CIP.RecordList.Record.Score.1                       : chr "xyx"
#  $ Individual.General.FirstName                        : chr "MM"
#  $ Inquiries.InquiryList.Inquiry.DateOfInquiry         : chr "2017-03-19"
#  $ Inquiries.InquiryList.Inquiry.Reason                : chr "cc"
#  $ Inquiries.InquiryList.Inquiry.DateOfInquiry.1       : chr "2016-10-14"
#  $ Inquiries.InquiryList.Inquiry.Reason.1              : chr "er"
#  $ Inquiries.Summary.NumberOfInquiries                 : chr "2"

如果要转换具有适当表示形式的数字字符串,请假定df是上面的数据框:

If you want to convert strings that have a suitable representation as numbers, assuming that df is the data frame above:

data.frame(t(lapply(df, function(x) 
               ifelse(is.na(y<-suppressWarnings(as.numeric(x))), x, y))))

没有数字表示的字符串将不会转换.

Strings that do not have a number representation will not be converted.

动机

A)在某些评论中,OP增加了对执行速度的进一步要求,对于诸如数据导入之类的一次性任务,这通常不是问题.上面的解决方案基于递归,如问题中明确要求的那样.当然,在节点上上下移动会增加很多开销.

A) In some comments the OP added a further request for execution speed, which is normally not a issue for one time tasks such as data import. The solution above is based on recursion, as explicitly required in the question. Of course, traversing up and down the nodes adds a lot of overhead.

B)这里最近的一个答案提出了一种基于一组外部工具的复杂方法.当然,可能会有其他不错的实用程序来管理XML文件,但是恕我直言,许多XPATH工作都可以在R本身中轻松有效地完成.

B) One recent answer here proposes a complex method based on a collection of external tools. There might of course be different nice utilities to manage XML files, but IMHO much of the XPATH work can be comfortably and efficiently done in R itself.

C)OP怀疑是否可以为XML的每个列表对象创建单独的data.frames".

C) The OP wonders if it is possible to "create separate data.frames for each list object of XML".

D)我注意到在问号中,OP(似乎)需要更新的xml2软件包.

D) I noticed that in the question tags, the OP (seems to) require the newer xml2 package.

我直接从R使用XPATH解决了以上几点.

I address the points above using XPATH straight from R.

XPATH方法

下面,我在单独的数据框中提取Record节点.一个也可以对其他(子)节点使用相同的方法.

Below I extract in a separate data frame the Record node. One can use the same approach for other (sub)nodes too.

library(xml2)
xx=read_xml(x)                                                                              
xx=(xml_find_all(xx, "//Record"))
system.time(
    xx <- xml_find_all(xx, ".//descendant::*[not(*)]"))
#  user  system elapsed 
# 38.00    0.36   38.35 
system.time(xx <- xml_text(xx))
#  user  system elapsed 
# 68.39    0.05   68.53 
head(data.frame(t(matrix(xx, 5))))
#                    X1 X2 X3    X4  X5
# 1 2017-05-26T00:00:00  2  R local xxx
# 2 2017-04-30T00:00:00  2  R       xyx
# 3 2017-05-26T00:00:00  2  R local xxx
# 4 2017-04-30T00:00:00  2  R       xyx
# 5 2017-05-26T00:00:00  2  R local xxx
# 6 2017-04-30T00:00:00  2  R       xyx

(您可能想添加更多代码来命名数据框列)

(You might want to add further code to name data frame columns)

时间是指我的平均笔记本电脑时间.

Time is referred to my average laptop.

说明

解决方案的核心在于XPATH .//descendant::*[not(*)]. .//descendant::提取当前上下文的所有后代(Record节点);添加[not(*)]进一步使布局平坦.这样可以线性化树结构,使其更适合于数据科学建模.

The core of the solutions lies in the XPATH .//descendant::*[not(*)]. .//descendant:: extracts all descendants of the current context (the Record node); adding [not(*)] further flattens the layout. This allows to linearize a tree structure, making it more for suitable for data science modeling.

*的灵活性是以计算为代价的.但是,计算负担并不在于R(一种解释型语言),而是以高效的外部C库 libxml2 .结果应该等于或优于其他实用程序和库.

The flexibility of * comes at a price in terms of computation. However, the computational burden does no lie on R, which is an interpreted language, but comes at the expenses of the highly efficient external C library libxml2. Results should be equal or better than those of other utilities and libraries.

这篇关于将大而复杂的XML文件解析为data.frame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆