使用 R 的 xmlEventParse 存储特定的 XML 节点值 [英] Storing specific XML node values with R's xmlEventParse

查看:55
本文介绍了使用 R 的 xmlEventParse 存储特定的 XML 节点值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的 XML 文件,我需要用 R 中的 xmlEventParse 来解析它.不幸的是,在线示例比我需要的更复杂,我只想标记一个匹配的节点标签来存储匹配的节点文本(不是属性),每个文本都在一个单独的列表中,请参阅下面代码中的注释:

I have a big XML file which I need to parse with xmlEventParse in R. Unfortunately on-line examples are more complex than I need, and I just want to flag a matching node tag to store the matched node text (not attribute), each text in a separate list, see the comments in the code below:

library(XML)
z <- xmlEventParse(
    "my.xml", 
    handlers = list(
        startDocument   =   function() 
        {
                cat("Starting document\n")
        },  
        startElement    =   function(name,attr) 
        {
                if ( name == "myNodeToMatch1" ){
                    cat("FLAG Matched element 1\n")
                }
                if ( name == "myNodeToMatch2" ){
                    cat("FLAG Matched element 2\n")
                }
        },
        text            =   function(text) {
                if ( # Matched element 1 .... )
                    # Store text in element 1 list
                if ( # Matched element 2 .... )
                    # Store text in element 2 list
        },
        endDocument     =   function() 
        {
                cat("ending document\n")
        }
    ),
    addContext = FALSE,
    useTagName = FALSE,
    ignoreBlanks = TRUE,
    trim = TRUE)
z$ ... # show lists ??

我的问题是,如何在 R 中实现这个标志(以专业的方式:)?另外:评估 N 个任意节点以匹配的最佳选择是什么...如果 name = "myNodeToMatchN" ... 避免大小写匹配的节点?

My question is, how to implement this flag in R (in a professional way :)? Plus: What's the best choice to evaluate N arbitrary nodes to match... if name = "myNodeToMatchN" ... nodes avoiding case matching?

my.xml 可能只是一个简单的 XML 之类的

my.xml could be just a naive XML like

<A>
  <myNodeToMatch1>Text in NodeToMatch1</myNodeToMatch1>
  <B>
    <myNodeToMatch2>Text in NodeToMatch2</myNodeToMatch2>
    ...
  </B>
</A>

推荐答案

我将使用 example(xmlEventParse) 中的 fileName 作为可重现的示例.它有标签 record 有一个属性 id 和我们想要提取的文本.我将使用 branches 参数,而不是使用 handler.这就像一个处理程序,但可以访问完整节点而不仅仅是元素.我们的想法是编写一个闭包来保存我们积累的数据,并编写一个函数来处理我们感兴趣的 XML 文档的每个分支.所以让我们从定义闭包开始——为了我们的目的,一个函数返回函数列表

I'll use fileName from example(xmlEventParse) as a reproducible example. It has tags record that have an attribute id and text that we'd like to extract. Rather than use handler, I'll go after the branches argument. This is like a handler, but one has access to the full node rather than just the element. The idea is to write a closure that has a place to keep the data we accumulate, and a function to process each branch of the XML document we are interested in. So let's start by defining the closure -- for our purposes, a function that returns a list of functions

ourBranches <- function() {

我们需要一个地方来存储我们累积的结果,选择一个环境以便插入时间是恒定的(不是一个列表,我们必须附加到它并且内存效率低下)

We need a place to store the results we accumulate, choosing an environment so that the insertion times are constant (not a list, which we would have to append to and would be memory inefficient)

    store <- new.env() 

事件解析器期望在发现匹配标签时调用函数列表.我们对 record 标签很感兴趣.我们编写的函数将接收 XML 文档的一个节点.我们想要提取一个元素 id,我们将用它来存储节点中的(文本)值.我们将这些添加到我们的商店.

The event parser is expecting a list of functions to be invoked when a matching tag is discovered. We're interested in the record tag. The function we write will receive a node of the XML document. We want to extract an element id that we'll use to store the (text) values in the node. We add these to our store.

    record <- function(x, ...) {
        key <- xmlAttrs(x)[["id"]]
        value <- xmlValue(x)
        store[[key]] <- value
    }

处理完文档后,我们想要一种方便的方式来检索我们的结果,因此我们为自己的目的添加了一个函数,独立于文档中的节点

Once the document is processed, we'd like a convenient way to retrieve our results, so we add a function for our own purposes, independent of nodes in the document

    getStore <- function() as.list(store)

然后通过返回函数列表完成闭包

and then finish the closure by returning a list of functions

    list(record=record, getStore=getStore)
}

这里有一个棘手的概念是定义函数的环境是函数的一部分,所以每次我们说ourBranches()时,我们都会得到一个函数列表 一个新环境 store 来保存我们的结果.要使用,请在我们的文件上调用 xmlEventParse,使用一组空的事件处理程序,并访问我们累积的存储.

A tricky concept here is that the environment in which a function is defined is part of the function, so each time we say ourBranches() we get a list of functions and a new environment store to keep our results. To use, invoke xmlEventParse on our file, with an empty set of event handlers, and access our accumulated store.

> branches <- ourBranches()
> xmlEventParse(fileName, list(), branches=branches)
list()
> head(branches$getStore(), 2)
$`Hornet Sportabout`
[1] "18.7   8 360.0 175 3.15 3.440 17.02  0  0    3 "

$`Toyota Corolla`
[1] "33.9   4  71.1  65 4.22 1.835 19.90  1  1    4 "

这篇关于使用 R 的 xmlEventParse 存储特定的 XML 节点值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆