在 R 中解析 XML 文件(> 1 兆字节) [英] Parse XML Files (>1 megabyte) in R

查看:21
本文介绍了在 R 中解析 XML 文件(> 1 兆字节)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我有大约 20,000 个 XML 文件,大小从几 KB 到几 MB 不等.虽然它可能并不理想,但我使用 XML 包中的xmlTreeParse"函数来循环遍历每个文件并提取我需要的文本并将文档保存为 csv 文件.

Currently I have ~20,000 XML files that range in size from a couple of KB to a few MB. Although it may not be ideal, I am using the "xmlTreeParse" function in the XML package to loop through each of the files and extract the text that I need and save the document as a csv file.

以下代码适用于小于 1 MB 的文件:

The code below works fine for files <1 MB in size:

files <- list.files()
for (i in files) {
    doc <- xmlTreeParse(i, useInternalNodes = TRUE)
    root <- xmlRoot(doc)

    name <- xmlValue(root[[8]][[1]][[1]]) # Name
    data <- xmlValue(root[[8]][[1]]) # Full text

    x <- data.frame(c(name))
    x$data <- data

    write.csv(x, paste(i, ".csv"), row.names=FALSE, na="")
}

问题是任何大于 1 MB 的文件都会给我以下错误:

The trouble is that any file >1 MB gives me the following error:

Excessive depth in document: 256 use XML_PARSE_HUGE option
Extra content at the end of the document
Error: 1: Excessive depth in document: 256 use XML_PARSE_HUGE option
2: Extra content at the end of the document

请原谅我的无知,但是我已经尝试在 XML 包中搜索XML_PARSE_HUGE"函数,但似乎找不到它.有没有人有使用这个功能的经验?如果是这样,我将非常感谢有关如何让此代码处理稍大的 XML 文件的任何建议.

Please forgive my ignorance, however I have tried searching for the "XML_PARSE_HUGE" function in the XML package and can't seem to find it. Has anyone had any experience using this function? If so, I would greatly appreciate any advice as to how to get this code to handle slightly larger XML files.

谢谢!

推荐答案

要选择XML_PARSE_HUGE",需要在选项中进行规定.XML:::parserOptions 列出选项选项:

To choose "XML_PARSE_HUGE" you need to stipulate it in the options. XML:::parserOptions lists the option choices:

> XML:::parserOptions
   RECOVER      NOENT    DTDLOAD    DTDATTR   DTDVALID    NOERROR  NOWARNING 
         1          2          4          8         16         32         64 
  PEDANTIC   NOBLANKS       SAX1   XINCLUDE      NONET     NODICT    NSCLEAN 
       128        256        512       1024       2048       4096       8192 
   NOCDATA NOXINCNODE    COMPACT      OLD10  NOBASEFIX       HUGE     OLDSAX 
     16384      32768      65536     131072     262144     524288    1048576 

例如

> HUGE
[1] 524288

使用这些选项中的任何一个声明整数向量就足够了.你的情况

It is suffiecient to declare a vector of integers with any of these options. In your case

xmlTreeParse(i, useInternalNodes = TRUE, options = HUGE)

这篇关于在 R 中解析 XML 文件(> 1 兆字节)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆