XML 数据提取,其中并非所有父节点都包含子节点 [英] XML data extraction where not all parent nodes contain the child node

查看:29
本文介绍了XML 数据提取,其中并非所有父节点都包含子节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 xml 数据文件,用户在其中开设了一个帐户,但在某些情况下该帐户已被终止.数据没有列出账户未终止时的值,这使得提取信息非常困难.

I have an xml data file where user has opened an account and in some cases the account has been terminated. The data does not list the value when account has not been terminated, which makes it very difficult to extract the information.

以下是可重现的示例(其中只有用户 1 和 3 的帐户已被终止):

Here is the reproducible example (where only user 1 and 3 have had their account terminated):

library(XML)
my_xml <- xmlParse('<accounts>
                    <user>
                      <id>1</id>
                      <start>2015-01-01</start>
                      <termination>2015-01-21</termination>
                    </user>
                    <user>
                      <id>2</id>
                      <start>2015-01-01</start>
                    </user>
                    <user>
                      <id>3</id>
                      <start>2015-02-01</start>
                      <termination>2015-04-21</termination>
                    </user>
                    <user>
                      <id>4</id>
                      <start>2015-03-01</start>
                    </user>
                    <user>
                      <id>5</id>
                      <start>2015-04-01</start>
                    </user>
                    </accounts>')

要创建一个 data.frame 我已经尝试使用 sapply 但是由于当用户没有终止值时它不返回 NA,代码会产生一个 error:arguments imply different行数:5, 2

To create a data.frame I've tried using sapply however due to it not returning NA when user does not have a termination value, the code produces an error: arguments imply differing number of rows: 5, 2

accounts <- data.frame(id=sapply(my_xml["//user//id"], xmlValue),
                       start=sapply(my_xml["//user//start"], xmlValue),
                       termination=sapply(my_xml["//user//termination"], xmlValue)
                       )

关于如何解决这个问题有什么建议吗?

Any suggestions on how to solve this problem ?

推荐答案

我更喜欢使用 xml2 包而不是 XML 包,我发现语法更易于使用.这是一个直截了当的问题.找到所有用户节点,然后解析出 id 和终止节点.对于 xml2,如果未找到节点,xml_find_first 函数将返回 NA.

I prefer to use the xml2 package over the XML package, I find the syntax easier to use. This is a straight forward problem. Find all of the user nodes and then parse out the id and termination nodes. With xml2, the xml_find_first function will return NA if the node is not found.

library(xml2)
my_xml <- read_xml('<accounts>
                   <user>
                   <id>1</id>
                   <start>2015-01-01</start>
                   <termination>2015-01-21</termination>
                   </user>
                   <user>
                   <id>2</id>
                   <start>2015-01-01</start>
                   </user>
                   <user>
                   <id>3</id>
                   <start>2015-02-01</start>
                   <termination>2015-04-21</termination>
                   </user>
                   <user>
                   <id>4</id>
                   <start>2015-03-01</start>
                   </user>
                   <user>
                   <id>5</id>
                   <start>2015-04-01</start>
                   </user>
                   </accounts>')

usernodes<-xml_find_all(my_xml, ".//user")
ids<-sapply(usernodes, function(n){xml_text(xml_find_first(n, ".//id"))})
terms<-sapply(usernodes, function(n){xml_text(xml_find_first(n, ".//termination"))})

answer<-data.frame(ids, terms)

这篇关于XML 数据提取,其中并非所有父节点都包含子节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆