使用 R 的 xmlEventParse 存储 XML 节点值以进行过滤输出 [英] Storing XML node values with R's xmlEventParse for filtered output

查看:19
本文介绍了使用 R 的 xmlEventParse 存储 XML 节点值以进行过滤输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的 xml 文件 (260mb),其中包含大量如下所示的信息:

I have a huge xml file (260mb) with tons of information looking like this:

示例:

<mydocument>
<POSITIONS EventTime="2012-09-29T20:31:21" InternalMatchId="0000T0">
<FrameSet GameSection="1sthalf" Match="0000T0" Club="REFEREE" Object="00011D">
<Frame N="0" T="2012-09-29T18:31:21" X="-0.1158" Y="0.2347" S="1.27" />
<Frame N="1" T="2012-09-29T18:31:21" X="-0.1146" Y="0.2351" S="1.3" />
<Frame N="2" T="2012-09-29T18:31:21" X="-0.1134" Y="0.2356" S="1.33" />
</FrameSet>
<FrameSet GameSection="2ndhalf" Match="0000T0" Club="REFEREE" Object="00011D">
<Frame N="0" T="2012-09-29T18:31:21" X="-0.1158" Y="0.2347" S="1.27" />
<Frame N="1" T="2012-09-29T18:31:21.196" X="-0.1146" Y="0.2351" S="1.3" />
<Frame N="2" T="2012-09-29T18:31:21.243" X="-0.1134" Y="0.2356" S="1.33" />
</FrameSet>
</POSITIONS>
</mydocument>

大约有 40 个不同的 FrameSet 节点,每个节点都有不同的 GameSection="..."Object="...".

there are around 40 different FrameSet nodes, each with a different GameSection="..." and Object="...".

我很想将 节点的信息提取到 list 对象中,但我无法加载整个 xml 文件,因为它太大了.有没有什么办法,我可以使用xmlEventParse 函数来过滤特定的GameSection 和特定的Object,并从相应的Object 中获取所有信息 元素?

I would love to extract the information of the <Frame> nodes into a list object but I cannot load the whole xml file because it is too large. Is there any way, I can use the xmlEventParse function to filter for a specific GameSection and a specific Object and get all the information from the corresponding <Frame> elements?

推荐答案

可能是内部"表示没有那么大

It might be that the 'internal' representation is not that large

xml = xmlTreeParse("file.xml", useInternalNodes=TRUE)

然后 xpath 绝对是你最好的选择.如果这不起作用,您将需要解决闭包问题.我将针对 xmlEventParsebranches 参数,它允许混合事件解析遍历文件,并结合每个节点上的 DOM 解析.这是一个返回函数列表的函数.

and then xpath will definitely be your best bet. If that doesn't work, you'll need to get your head around closures. I'm going to aim for the branches argument of xmlEventParse, which allows a hybrid event parsing to iterate through the file, coupled with DOM parsing on each node. Here's a function that returns a list of functions.

branchFactory <-
    function()
{
    env <- new.env(parent=emptyenv())   # safety

    FrameSet <- function(elt) {
        id <- paste(xmlAttrs(elt), collapse=":")
        env[[id]] <- xpathSApply(elt, "//Frame", xmlAttrs)
    }

    get <- function() env

    list(get=get, FrameSet=FrameSet)
}

在这个函数中,我们将创建一个地方来存储我们遍历文件时的结果.这可能是一个列表,但最好使用环境.这将允许我们插入新结果而无需复制我们已经插入的所有结果.这是我们的环境:

Inside this function we're going to create a place to store our results as we iterate through the file. This could be a list, but it'll be better to use an environment. This will allow us to insert new results without copying all the results that we've already inserted. So here's our environment:

    env <- new.env(parent=emptyenv())

我们使用 parent 参数作为安全措施,即使它与我们目前的情况无关.现在我们定义一个函数,每当遇到FrameSet"节点时都会调用该函数

we use the parent argument as a measure of safety, even if it's not relevant in our present case. Now we define a function that will be invoked whenever a "FrameSet" node is encountered

    FrameSet <- function(elt) {
        id <- paste(xmlAttrs(elt), collapse=":")
        env[[id]] <- xpathSApply(elt, "//Frame", xmlAttrs)
    }

事实证明,当我们使用 branches 参数时,xmlEventParse 会安排将整个节点解析为我们可以通过 DOM 操作的对象,例如,使用 xlmAttrsxpathSApply.这个函数的第一行为这个框架集创建了一个唯一的标识符(?也许整个数据集不是这样?你需要一个唯一的标识符).然后我们解析元素的//Frame"部分,并将其存储在我们的环境中.存储结果比看起来更棘手——我们将分配给一个名为 env 的变量.env 不存在于 FrameSet 函数的主体中,因此 R 使用其词法范围规则在 FrameSet 函数所在的环境中搜索名为 env 的变量定义.瞧,它找到了我们已经创建的 env.这是我们将 xpathSApply 的结果添加到的地方.这就是我们的 FrameSet 节点解析器.

It turns out that, when we use the branches argument, the xmlEventParse will have arranged to parse the entire node into an object that we can manipulate via the DOM, e.g., using xlmAttrs and xpathSApply. The first line of this function creates a unique identifier for this frame set (? maybe that's not the case for the full data set? You'll need a unique identifier). we then parse the "//Frame" part of the element, and store that in our environment. Storing the result is trickier than it looks -- we're assigning to a variable called env. env doesn't exist in the body of the FrameSet function, so R uses its lexical scoping rules to search for a variable named env in the environment in which the FrameSet function was defined. And lo, it finds the env that we have already created. This is where we add the result of xpathSApply to. That's it for our FrameSet node parser.

我们还想要一个方便的函数,我们可以用它来检索 env,像这样:

We'd also like a convenience function that we can use to retrieve env, like this:

    get <- function() env

同样,这将使用词法范围来查找在 branchFactory 顶部创建的 env 变量.我们通过返回我们定义的函数列表来结束 branchFactory

Again, this is going to use lexical scoping to find the env variable created at the top of branchFactory. We end branchFactory by returning a list of the functions that we've defined

    list(get=get, FrameSet=FrameSet)

这也出奇的棘手——我们要返回一个函数列表.这些函数是在我们调用 branchFactory 时创建的环境中定义的,为了使词法作用域起作用,环境必须持续存在.所以实际上我们不仅返回函数列表,而且隐式地返回变量env.简要

This too is surprisingly tricky -- we're returning a list of functions. The functions are defined in the environment created when we invoke branchFactory and, for lexical scope to work, the environment has to persist. So actually we're returning not only the list of functions, but also, implicitly, the variable env. In brief

我们现在准备解析我们的文件.通过创建分支解析器的实例来实现这一点,使用它自己独特版本的 getFrameSet 函数以及创建的 env 变量存储结果.然后解析文件

We're now ready to parse our file. Do this by creating an instance of the branch parser, with it's own unique versions of the get and FrameSet functions and of the env variable created to store results. Then parse the file

b <- branchFactory()
xx <- xmlEventParse("file.xml", handlers=list(), branches=b)

我们可以使用 b$get() 检索结果,如果方便,可以将其转换为列表.

We can retrieve the results using b$get(), and can cast this to a list if that's convenient.

> as.list(b$get())
$`1sthalf:0000T0:REFEREE:00011D`
  [,1]                  [,2]                  [,3]                 
N "0"                   "1"                   "2"                  
T "2012-09-29T18:31:21" "2012-09-29T18:31:21" "2012-09-29T18:31:21"
X "-0.1158"             "-0.1146"             "-0.1134"            
Y "0.2347"              "0.2351"              "0.2356"             
S "1.27"                "1.3"                 "1.33"               

$`2ndhalf:0000T0:REFEREE:00011D`
  [,1]                  [,2]                      [,3]                     
N "0"                   "1"                       "2"                      
T "2012-09-29T18:31:21" "2012-09-29T18:31:21.196" "2012-09-29T18:31:21.243"
X "-0.1158"             "-0.1146"                 "-0.1134"                
Y "0.2347"              "0.2351"                  "0.2356"                 
S "1.27"                "1.3"                     "1.33"                   

这篇关于使用 R 的 xmlEventParse 存储 XML 节点值以进行过滤输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆