使用getNodeSet解析XML-识别丢失的标签 [英] Parse XML with getNodeSet - Identify missing tags

查看:121
本文介绍了使用getNodeSet解析XML-识别丢失的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用getNodeSet()解析XML文件.假设我有一家书店的XML文件,其中列出了4本书,但其中一本书的标签"authors"丢失了.

I am parsing a XML file with getNodeSet(). Assume I have a XML file from a bookstore with 4 different books listed, but for one book the tag "authors" is missing.

如果我使用data.nodes.2 <- getNodeSet(data,'//*/authors')解析标签"authors"的XML,R将返回3个元素的列表.

If I parse the XML for the tag "authors" by using data.nodes.2 <- getNodeSet(data,'//*/authors'), R returns a list of 3 elements.

但是,这并不是我想要的.如何获取"getNodeSet()"以返回一个列表,该列表包含4个元素而不是3个元素,即一个元素的值缺失而标签"authors"不存在.

However, this is not exactly what I want. How do get "getNodeSet()" to return a list which has 4 instead of three elements, i.e. one element that has a missing value where the tag "authors" does not exist.

感谢您的帮助.

library(XML)

file <- "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\r\n<!-- Edited by XMLSpy® -->\r\n<bookstore>\r\n<book category=\"cooking\">\r\n<title lang=\"en\">Everyday Italian</title>\r\n<authors>\r\n<author>Giada De Laurentiis</author>\r\n</authors>\r\n<year>2005</year>\r\n<price>30.00</price>\r\n</book>\r\n<book category=\"children\">\r\n<title lang=\"en\">Harry Potter</title>\r\n<authors>\r\n<author>J K. Rowling</author>\r\n</authors>\r\n<year>2005</year>\r\n<price>29.99</price>\r\n</book>\r\n<book category=\"web\">\r\n<title lang=\"en\">XQuery Kick Start</title>\r\n<authors>\r\n<author>James McGovern</author>\r\n<author>Per Bothner</author>\r\n<author>Kurt Cagle</author>\r\n<author>James Linn</author>\r\n<author>Vaidyanathan Nagarajan</author>\r\n</authors>\r\n<year>2003</year>\r\n<price>49.99</price>\r\n</book>\r\n<book category=\"web\" cover=\"paperback\">\r\n<title lang=\"en\">Learning XML</title>\r\n\r\n<year>2003</year>\r\n<price>39.95</price>\r\n</book>\r\n</bookstore>"

data <- xmlParse(file)

data.nodes.1 <- getNodeSet(data,'//*/book')

data.nodes.2 <- getNodeSet(data,'//*/authors')


# Data

# <?xml version="1.0" encoding="ISO-8859-1"?>
# <!-- Edited by XMLSpy® -->
# <bookstore>
#   <book category="cooking">
#     <title lang="en">Everyday Italian</title>
#     <authors>
#       <author>Giada De Laurentiis</author>
#     </authors>
#     <year>2005</year>
#     <price>30.00</price>
#   </book>
#   <book category="children">
#     <title lang="en">Harry Potter</title>
#     <authors>
#       <author>J K. Rowling</author>
#     </authors>
#     <year>2005</year>
#     <price>29.99</price>
#   </book>
#   <book category="web">
#     <title lang="en">XQuery Kick Start</title>
#     <authors>
#       <author>James McGovern</author>
#       <author>Per Bothner</author>
#       <author>Kurt Cagle</author>
#       <author>James Linn</author>
#       <author>Vaidyanathan Nagarajan</author>
#     </authors>
#     <year>2003</year>
#     <price>49.99</price>
#   </book>
#   <book category="web" cover="paperback">
#     <title lang="en">Learning XML</title>
#     <year>2003</year>
#     <price>39.95</price>
#   </book>
# </bookstore>

推荐答案

一种选择是使用R的列表处理从每个节点中提取作者

One option is to use R's list processing to extract authors from each node

books <- getNodeSet(doc, "//book")
authors <- lapply(books, xpathSApply, ".//author", xmlValue)
authors[sapply(authors, is.list)] <- NA

并使用书籍级信息修改该内容

and to munge that with book-level info

title <- sapply(books, xpathSApply, "string(.//title/text())")

给予

>     data.frame(Title=rep(title, sapply(authors, length)),
+                Author=unlist(authors))
              Title                 Author
1  Everyday Italian    Giada De Laurentiis
2      Harry Potter           J K. Rowling
3 XQuery Kick Start         James McGovern
4 XQuery Kick Start            Per Bothner
5 XQuery Kick Start             Kurt Cagle
6 XQuery Kick Start             James Linn
7 XQuery Kick Start Vaidyanathan Nagarajan
8      Learning XML                   <NA>

这篇关于使用getNodeSet解析XML-识别丢失的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆