将xml对象写入磁盘 [英] Write xml-object to disk

查看:71
本文介绍了将xml对象写入磁盘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆xml文件,我需要处理这些文件.为此,我希望能够读取文件,并将结果对象列表保存到磁盘.我尝试使用readr::write_rds保存该列表,但是在再次阅读后,该对象已进行了一些修改,并且不再有效.我有什么办法可以减轻这个问题?

I have a big bunch of xml-files, which I need to process. For that matter I want to be able to read the files, and save the resulting list of objects to disk. I tried to save the list with readr::write_rds, but after reading it in again, the object is somewhat modified, and not valid any more. Is there anything I can do to alleviate this problem?

library(readr)
library(xml2)

x <- read_xml("<foo>
              <bar>text <baz id = 'a' /></bar>
              <bar>2</bar>
              <baz id = 'b' />
              </foo>")

# function to save and read object
roundtrip <- function(obj) {
  tf <- tempfile()
  on.exit(unlink(tf))

  write_rds(obj, tf)
  read_rds(tf)
}

list(x)
#> [[1]]
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
roundtrip(list(x))
#> [[1]]
#> {xml_document}

identical(x, roundtrip(x))
#> [1] FALSE
all.equal(x, roundtrip(x))
#> [1] TRUE
xml_children(roundtrip(x))
#> Error in fun(x$node, ...): external pointer is not valid
as_list(roundtrip(x))
#> Error in fun(x$node, ...): external pointer is not valid

某些上下文

我有大约500,000个xml文件.为了处理它们,我计划使用xml2::as_list将它们转换为列表,并编写了代码以提取所需内容.之后,我意识到as_list的运行非常昂贵.我可以:

Some context

I have around 500,000 xml-files. To process them I planned on turning them into a list with xml2::as_list and I wrote code to extract what I need. Afterwards I realized, that as_list is very expensive to run. I could either:

  1. 重新编写已经仔细调试的代码以直接解析数据(xml_childxml_text,...)或
  2. 使用as_list.
  1. re-write already carefully debugged code to parse data directly (xml_child, xml_text, ...), or
  2. use as_list.

为了加速没有. 2我可以在具有更多内核的另一台计算机上运行它,但是我想将一个文件传递到该计算机,因为收集和复制所有文件非常耗时.

In order to speed up no. 2 I could run it on another machine with more cores, but I would like to pass a single file to that machine, because collecting and copying all files is time-consuming.

推荐答案

xml2 对象具有外部指针,这些指针在您天真地对其进行序列化时将变得无效.软件包提供了xml_serialize()xml_unserialize()对象来为您处理此问题.不幸的是,该API有点麻烦,因为base::serialize()base::unserialize()假定打开了连接.

xml2 objects have external pointers that become invalid when you serialize them naively. The package provides xml_serialize() and xml_unserialize() objects to handle this for you. Unfortunately the API is slightly cumbersome because base::serialize() and base::unserialize() assume an open connection.


library(xml2)

x <- read_xml("<foo>
              <bar>text <baz id = 'a' /></bar>
              <bar>2</bar>
              <baz id = 'b' />
              </foo>")

# function to save and read object
roundtrip <- function(obj) {
  tf <- tempfile()
  con <- file(tf, "wb")
  on.exit(unlink(tf))

  xml_serialize(obj, con)
  close(con)
  con <- file(tf, "rb")
  on.exit(close(con), add = TRUE)
  xml_unserialize(con)
}
x
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
(y <- roundtrip(x))
#> {xml_document}
#> <foo>
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>

identical(x, y)
#> [1] FALSE
all.equal(x, y)
#> [1] TRUE
xml_children(y)
#> {xml_nodeset (3)}
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
as_list(y)
#> $bar
#> $bar[[1]]
#> [1] "text "
#> 
#> $bar$baz
#> list()
#> attr(,"id")
#> [1] "a"
#> 
#> 
#> $bar
#> $bar[[1]]
#> [1] "2"
#> 
#> 
#> $baz
#> list()
#> attr(,"id")
#> [1] "b"

关于问题的第二部分,即使您必须重写代码,我也会认真考虑使用XPATH表达式来提取所需的数据.

Also in regards to the second part of your question, I would seriously consider using XPATH expressions to extract the desired data, even if you have to rewrite code.

这篇关于将xml对象写入磁盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆