在R中解析XML SAX方式 [英] Parsing an XML SAX way in R

查看:176
本文介绍了在R中解析XML SAX方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

源自这个问题,我对R(和其他)文档的研究表明,SAX方法将是一种更快的方式解析XML数据。很遗憾,我找不到很多工作示例,无法了解如何到达这里。



这里是一个虚拟文件,其中包含我需要解析的信息。真正的事情会有更多的< ITEM> 节点和其他节点在我想要排除的树上。另一个特点是< META> 节有两个< DESC> 他们(不是两个)。

 < FILE> 
< HEADER>
< FILEID> 12347< / FILEID>
< / HEADER>
< META>
< DESC>
< TYPE> A< / TYPE>
< CODE> ABC< / CODE>
< VALUE> 100000< / VALUE>
< / DESC>
< DESC>
< TYPE> B< / TYPE>
< CODE> ABC< / CODE>
< VALUE> 100000< / VALUE>
< / DESC>
< / META>
< BODY>
< ITEM>
< IVALUE> 1000< / IVALUE>
< ICODE> CDF< / ICODE>
< ITYPE> R< / ITYPE>
< / ITEM>
< ITEM>
< IVALUE> 1500< / IVALUE>
< ICODE> EGK< / ICODE>
< ITYPE> R< / ITYPE>
< / ITEM>
< ITEM>
< IVALUE> 300< / IVALUE>
< ICODE> TSR< / ICODE>
< ITYPE> R< / ITYPE>
< / ITEM>
< / BODY>
< / FILE>

对于上面的示例XML,我想获得

 > data.table(fileid = 12347,code =ABC,value = 10000,ivalue = c(1000,1500,300),icode = c(CDF,EGK,TSR),itype =R )
#fileid code value ivalue icode itype
#1:12347 ABC 10000 1000 CDF R
#2:12347 ABC 10000 1500 EGK R
#3:12347 ABC 10000 300 TSR R

任何有 SAX 我用 xmlEventParse()

解决方案来构建一个解析器来解决方案

p> Simple API for XML可能会提高解析XML数据的速度,而另一种方法,但通常使用SAX不会给你比XPath更好的结果。相反,对于更大的文件,它将允许不加载R中的完整树,从而避免潜在的内存泄漏。



对于使用SAX,您可以使用下面的代码示例,它是基于 xmlEventParse 分支(每个数据要检索一个分支):

 使用xmlEventParse读取的#a文件
xmlDoc< - example.xml

desc< - NULL
items< NULL

#function用于xmlEventParse
row.sax = function(){

Meta'DESC'的#SAX函数
DESC = function (node){
children < - xmlChildren(node)
children [which(names(children)==text)] < - NULL
desc< (b)
ITEM = function(node){
children< - > xmlChildren(node)
children [which(names(children)==text)] < - NULL
items< - rbind(items,sapply(children,xmlValue))
}

branches< - list(DESC = DESC,ITEM = ITEM)
return(branches)
}

#call xmlEventParse
xmlEventParse(xmlDoc,handlers = list(),branches = row.sax(),
saxVersion = 2,trim = FALSE)

将结果处理为data.frame
desc< - as.data.frame(desc,stringsAsFactors = F)
desc< - desc [rep(row.names(desc [1,]),nrow(items)

items< - as.data.frame(items,stringsAsFactors = F)

result< - cbind(desc,items)
row.names结果)< - 1:nrow(result)

让我知道如果它适用于你


Originating from this question, my research of R (and other) documentation indicates that SAX approach will be a faster way to parse XML data. Sadly I couldn't find much working examples for me to understand how to get there.

Here's a dummy file with information that I want parsed. The real thing would have substantially more <ITEM> nodes and other nodes all around the tree that I would like to exclude. Another peculiarity is that the <META> section has two <DESC> elements, and I need any one of them (not both).

<FILE>
  <HEADER>
    <FILEID>12347</FILEID>
  </HEADER>
  <META>
    <DESC>
      <TYPE>A</TYPE>
      <CODE>ABC</CODE>
      <VALUE>100000</VALUE>
    </DESC>
    <DESC>
      <TYPE>B</TYPE>
      <CODE>ABC</CODE>
      <VALUE>100000</VALUE>
    </DESC>
  </META>
  <BODY>
    <ITEM>
      <IVALUE>1000</IVALUE>
      <ICODE>CDF</ICODE>
      <ITYPE>R</ITYPE>
    </ITEM>
    <ITEM>
      <IVALUE>1500</IVALUE>
      <ICODE>EGK</ICODE>
      <ITYPE>R</ITYPE>
    </ITEM>
    <ITEM>
      <IVALUE>300</IVALUE>
      <ICODE>TSR</ICODE>
      <ITYPE>R</ITYPE>
    </ITEM>
  </BODY>
</FILE>

For the example XML above I'm looking to get

> data.table(fileid=12347, code="ABC", value=10000, ivalue=c(1000,1500,300), icode=c("CDF","EGK","TSR"), itype="R")
#    fileid code value ivalue icode itype
# 1:  12347  ABC 10000   1000   CDF     R
# 2:  12347  ABC 10000   1500   EGK     R
# 3:  12347  ABC 10000    300   TSR     R    

Could anyone with SAX experience guide me to building a parser to suit my needs with xmlEventParse()?

解决方案

The Simple API for XML might improve the speed in parsing the XML data vs. another approach, but generally using SAX will not give you better results than XPath for example. On the contrary, for bigger files, it will allow not to load the complete tree in R, and thus avoid potential memory leaks.

For using SAX, you can use the below code example, which is based on the xmlEventParse branches (one branch per data you want to retrieve):

#a file to read with xmlEventParse
xmlDoc <- "example.xml"

desc <- NULL
items <- NULL

#function to use with xmlEventParse
row.sax = function() {

    #SAX function for Meta 'DESC'
    DESC = function(node){
        children <- xmlChildren(node)
        children[which(names(children) == "text")] <- NULL
        desc <<- rbind(desc, sapply(children,xmlValue))
    }

    #SAX function for Body 'ITEM'
    ITEM = function(node){
        children <- xmlChildren(node)
        children[which(names(children) == "text")] <- NULL
        items <<- rbind(items, sapply(children,xmlValue))
    }

    branches <- list(DESC = DESC, ITEM = ITEM)
    return(branches)
}

#call the xmlEventParse
xmlEventParse(xmlDoc, handlers = list(), branches = row.sax(),
              saxVersion = 2, trim = FALSE)

#processing the result as data.frame
desc <- as.data.frame(desc, stringsAsFactors = F)
desc <- desc[rep(row.names(desc[1,]), nrow(items)),]

items <- as.data.frame(items, stringsAsFactors = F)

result <- cbind(desc, items)
row.names(result) <- 1:nrow(result)

Let me know if it works for you

这篇关于在R中解析XML SAX方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆