R:带有大型、可变节点 XML 输入和数据帧转换的 xmlEventParse [英] R: xmlEventParse with Large, Varying-node XML Input and Conversion to Data Frame

查看:19
本文介绍了R:带有大型、可变节点 XML 输入和数据帧转换的 xmlEventParse的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约 100 个发布数据的 XML 文件,每个文件 > 10GB,格式如下:

I have ~100 XML files of publication data each > 10GB formatted like this:

<?xml version="1.0" encoding="UTF-8"?> 
<records xmlns="http://website">
<REC rid="this is a test">
    <UID>ABCD123</UID>
    <data_1>
        <fullrecord_metadata>
            <references count="3">
                <reference>
                    <uid>ABCD2345</uid>
                </reference>
                <reference>
                    <uid>ABCD3456</uid>
                </reference>
                <reference>
                    <uid>ABCD4567</uid>
                </reference>
            </references>
        </fullrecord_metadata>
    </data_1>
</REC>
<REC rid="this is a test">
    <UID>XYZ0987</UID>
    <data_1>
        <fullrecord_metadata>
            <references count="N">
            </references>
        </fullrecord_metadata>
    </data_1>
</REC>
</records>

,每个唯一条目(由 UID 索引)的引用数量不同,其中一些可能为零.

, with variation in the number of references for each unique entry (indexed by UID), some of which may be zero.

目标:为每个 XML 文件创建 1 个简单的 data.frame,如下所示-

The goal: create 1 simple data.frame per XML file as follows-

UID        reference
ABCD123    ABCD2345
ABCD123    ABCD3456
ABCD123    ABCD4567
XYZ0987    NULL

由于文件的大小和需要高效循环许多文件,我一直在探索 xmlEventParse 以限制内存使用.我可以成功地为每个REC"提取关键的唯一UID",并使用前面问题中的以下代码创建一个 data.frame:

Due to the size of files and need for efficient looping over many files, I have been exploring xmlEventParse to limit memory usage. I can successfully extract the key unique "UID"s for each "REC" and create a data.frame using the following code from prior questions:

branchFunction <- function() {
 store <- new.env() 
 func <- function(x, ...) {
 ns <- getNodeSet(x, path = "//UID")
 key <- xmlValue(ns[[1]])
 value <- xmlValue(ns[[1]])
 print(value)
 store[[key]] <- value
}
 getStore <- function() { as.list(store) }
 list(UID = func, getStore=getStore)
}

 myfunctions <- branchFunction()

 xmlEventParse(
  file = "test.xml", 
  handlers = NULL, 
  branches = myfunctions
 )

 DF <- do.call(rbind.data.frame, myfunctions$getStore())

但我无法成功存储参考数据,也无法处理单个 UID 的参考编号变化.感谢您的任何建议!

But I cannot successfully store the reference data nor handle the variation in reference numbers for a single UID. Thanks for any suggestions!

推荐答案

设置一个函数,该函数将为我们的元素数据创建一个临时存储区域,以及一个每次找到 a 时都会调用的函数.

Setup a function that will create a temp storage area for our element data as well as a function that will be called every time a is found.

library(XML)

uid_traverse <- function() {

  # we'll store them as character vectors and then make a data frame out of them.
  # this is likely one of the cheapest & fastest methods despite growing a vector
  # inch by inch. You can pre-allocate space and modify this idiom accordingly
  # for another speedup.

  uids <- c() 
  refs <- c()

  REC <- function(x) {

    uid <- xpathSApply(x, "//UID", xmlValue)
    ref <- xpathSApply(x, "//reference/uid", xmlValue)

    if (length(uid) > 0) {

      if (length(ref) == 0) {

        uids <<- c(uids, uid)
        refs <<- c(refs, NA_character_)

      } else {

        uids <<- c(uids, rep(uid, length(ref)))
        refs <<- c(refs, ref)

      } 

    } 

  }

  # we return a named list with the element handler and another
  # function that turns the vectors into a data frame

  list(
    REC = REC, 
    uid_df = function() { 
      data.frame(uid = uids, ref = refs, stringsAsFactors = FALSE)
    }
  )

}

我们需要这个函数的一个实例.

We need one instance of this function.

uid_f <- uid_traverse()

现在,我们调用 xmlEventParse() 并给它我们的函数,使用 invisible() 因为我们不需要 xmlEventParse() 返回的内容,只需要副作用:

Now, we call xmlEventParse() and give it our function, using invisible() since we don’t need what xmlEventParse() returns but just want the side-effects:

invisible(
  xmlEventParse(
  file = path.expand("~/data/so.xml"), 
  branches = uid_f["REC"])
)

而且,我们看到了结果:

And, we see the results:

uid_f$uid_df()
##       uid      ref
## 1 ABCD123 ABCD2345
## 2 ABCD123 ABCD3456
## 3 ABCD123 ABCD4567
## 4 XYZ0987     <NA>

这篇关于R:带有大型、可变节点 XML 输入和数据帧转换的 xmlEventParse的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆