使用 R 的 xmlEventParse 存储特定的 XML 节点值 [英] Storing specific XML node values with R's xmlEventParse
问题描述
我有一个很大的 XML 文件,我需要用 R 中的 xmlEventParse 来解析它一>.不幸的是,在线示例比我需要的更复杂,我只想标记一个匹配的节点标签来存储匹配的节点文本(不是属性),每个文本都在一个单独的列表中,请参阅下面代码中的注释:>
I have a big XML file which I need to parse with xmlEventParse in R. Unfortunately on-line examples are more complex than I need, and I just want to flag a matching node tag to store the matched node text (not attribute), each text in a separate list, see the comments in the code below:
library(XML)
z <- xmlEventParse(
"my.xml",
handlers = list(
startDocument = function()
{
cat("Starting document\n")
},
startElement = function(name,attr)
{
if ( name == "myNodeToMatch1" ){
cat("FLAG Matched element 1\n")
}
if ( name == "myNodeToMatch2" ){
cat("FLAG Matched element 2\n")
}
},
text = function(text) {
if ( # Matched element 1 .... )
# Store text in element 1 list
if ( # Matched element 2 .... )
# Store text in element 2 list
},
endDocument = function()
{
cat("ending document\n")
}
),
addContext = FALSE,
useTagName = FALSE,
ignoreBlanks = TRUE,
trim = TRUE)
z$ ... # show lists ??
我的问题是,如何在 R 中实现这个标志(以专业的方式:)?另外:评估 N 个任意节点以匹配的最佳选择是什么...如果 name = "myNodeToMatchN" ... 避免大小写匹配的节点?
My question is, how to implement this flag in R (in a professional way :)? Plus: What's the best choice to evaluate N arbitrary nodes to match... if name = "myNodeToMatchN" ... nodes avoiding case matching?
my.xml 可能只是一个简单的 XML 之类的
my.xml could be just a naive XML like
<A>
<myNodeToMatch1>Text in NodeToMatch1</myNodeToMatch1>
<B>
<myNodeToMatch2>Text in NodeToMatch2</myNodeToMatch2>
...
</B>
</A>
推荐答案
我将使用 example(xmlEventParse)
中的 fileName
作为可重现的示例.它有标签 record
有一个属性 id
和我们想要提取的文本.我将使用 branches
参数,而不是使用 handler
.这就像一个处理程序,但可以访问完整节点而不仅仅是元素.我们的想法是编写一个闭包来保存我们积累的数据,并编写一个函数来处理我们感兴趣的 XML 文档的每个分支.所以让我们从定义闭包开始——为了我们的目的,一个函数返回函数列表
I'll use fileName
from example(xmlEventParse)
as a reproducible example. It has tags record
that have an attribute id
and text that we'd like to extract. Rather than use handler
, I'll go after the branches
argument. This is like a handler, but one has access to the full node rather than just the element. The idea is to write a closure that has a place to keep the data we accumulate, and a function to process each branch of the XML document we are interested in. So let's start by defining the closure -- for our purposes, a function that returns a list of functions
ourBranches <- function() {
我们需要一个地方来存储我们累积的结果,选择一个环境以便插入时间是恒定的(不是一个列表,我们必须附加到它并且内存效率低下)
We need a place to store the results we accumulate, choosing an environment so that the insertion times are constant (not a list, which we would have to append to and would be memory inefficient)
store <- new.env()
事件解析器期望在发现匹配标签时调用函数列表.我们对 record
标签很感兴趣.我们编写的函数将接收 XML 文档的一个节点.我们想要提取一个元素 id
,我们将用它来存储节点中的(文本)值.我们将这些添加到我们的商店.
The event parser is expecting a list of functions to be invoked when a matching tag is discovered. We're interested in the record
tag. The function we write will receive a node of the XML document. We want to extract an element id
that we'll use to store the (text) values in the node. We add these to our store.
record <- function(x, ...) {
key <- xmlAttrs(x)[["id"]]
value <- xmlValue(x)
store[[key]] <- value
}
处理完文档后,我们想要一种方便的方式来检索我们的结果,因此我们为自己的目的添加了一个函数,独立于文档中的节点
Once the document is processed, we'd like a convenient way to retrieve our results, so we add a function for our own purposes, independent of nodes in the document
getStore <- function() as.list(store)
然后通过返回函数列表完成闭包
and then finish the closure by returning a list of functions
list(record=record, getStore=getStore)
}
这里有一个棘手的概念是定义函数的环境是函数的一部分,所以每次我们说ourBranches()
时,我们都会得到一个函数列表和 一个新环境 store
来保存我们的结果.要使用,请在我们的文件上调用 xmlEventParse
,使用一组空的事件处理程序,并访问我们累积的存储.
A tricky concept here is that the environment in which a function is defined is part of the function, so each time we say ourBranches()
we get a list of functions and a new environment store
to keep our results. To use, invoke xmlEventParse
on our file, with an empty set of event handlers, and access our accumulated store.
> branches <- ourBranches()
> xmlEventParse(fileName, list(), branches=branches)
list()
> head(branches$getStore(), 2)
$`Hornet Sportabout`
[1] "18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 "
$`Toyota Corolla`
[1] "33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 "
这篇关于使用 R 的 xmlEventParse 存储特定的 XML 节点值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!