在巨大的 XML 文件中组合值 [英] Combine values in huge XML-files

查看:22
本文介绍了在巨大的 XML 文件中组合值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在一些巨大的 XML 文件中查找并组合信息(doc <- xmlInternalTreeParse(file.name, useInternalNodes=TRUE, trim=TRUE) 导致我的 16GB 计算机在完成之前开始交换到磁盘),并且有遵循 http://www.omegahat.org/RSXML/Overview.html 上的良好说明.

I need to find and combine information in some huge XML-files (doc <- xmlInternalTreeParse(file.name, useInternalNodes=TRUE, trim=TRUE) causes my 16GB computer to start swapping to disk before finishing), and have followed the good instructions on http://www.omegahat.org/RSXML/Overview.html.

从那里添加到示例中,这或多或少是我的文件的样子:

Adding to the example from there, this is more or less what my file looks like:

<?xml version="1.0" ?>
<TABLE>
  <SCHOOL>
    <NAME> School1 </NAME>
    <GRADES>
      <STUDENT> Fred </STUDENT>
      <TEST1> 66 </TEST1>
      <TEST2> 80 </TEST2>
      <FINAL> 70 </FINAL>
    </GRADES>
    <TEAMS>
      <SOCCER> SoccerTeam1 </SOCCER>
      <HOCKEY> HockeyTeam1 </HOCKEY>
    </TEAMS>
  </SCHOOL>
  <SCHOOL>
    <NAME> School2 </NAME>
    <GRADES>
      <STUDENT> Wilma </STUDENT>
      <TEST1> 97 </TEST1>
      <TEST2> 91 </TEST2>
      <FINAL> 98 </FINAL>
    </GRADES>
    <TEAMS>
      <SOCCER> SoccerTeam2 </SOCCER>
    </TEAMS>
  </SCHOOL>
</TABLE>

我需要用曲棍球队和球队名称列出每所学校的学生.示例中想要的输出应该是Fred"、HockeyTeam1"、School1".真实的例子有成千上万的学校"、曲棍球队"和球员".

I need to list students per school with hockey-team, and the team-names. The wanted output from the example should be "Fred", "HockeyTeam1", "School1". The real example have thousands of "schools", "hockey teams" and "players".

如何使用 xmlEventParse 解析文件以提取信息?我试图从文件中提取所有文本字段,但经过数小时的等待仍然没有输出.注意:真实的文件比这个嵌套的多,所以单步固定级别来查找信息是不够的.

How can I use xmlEventParse to parse the files to extract the info? I tried to extract all text-fields from the files, but after hours of waiting there was still no output. Note: The real files are more nested than this, so it is not enought to step fixed levels to find info.

推荐答案

我们将使用 XML 包

We'll use the XML package

library(XML)

并创建一个闭包,其中包含一个处理SCHOOL"节点的函数,以及两个用于在完成后检索结果的辅助函数.SCHOOL 函数在每个 SCHOOL 节点上调用.如果它找到一个曲棍球队,它会使用/SCHOOL/NAME/text() 作为键",使用/SCHOOL/TEAMS/HOCKEY/text() 和//STUDENT/text()(或/SCHOOL/GRADES/STUDENT/text()) 作为值.每 100 所(默认情况下)有曲棍球队的学校都会打印一条消息,以便显示一些进展情况.'get' 函数用于获取结果.

and create a closure that contains a function to handle the 'SCHOOL' node, as well as two helper functions to retrieve results when done. The SCHOOL function is invoked on each SCHOOL node. If it finds a hockey team, it uses the /SCHOOL/NAME/text() as a 'key', and the /SCHOOL/TEAMS/HOCKEY/text() and //STUDENT/text() (or /SCHOOL/GRADES/STUDENT/text()) as values. A message is printed for every 100 (by default) schools with hockey teams, so that there's some indication of progress. The 'get' function is used after the fact to retrieve the result.

teams <- function(progress=1000) {
    res <- new.env(parent=emptyenv())   # for results
    it <- 0L                            # iterator -- nodes visited
    list(SCHOOL=function(elt) {
        ## handle 'SCHOOL' nodes 
        if (getNodeSet(elt, "not(/SCHOOL/TEAMS/HOCKEY)"))
            ## early exit -- no hockey team
            return(NULL)
        it <<- it + 1L
        if (it %% progress == 0L)
            message(it)
        school <- getNodeSet(elt, "string(/SCHOOL/NAME/text())") # 'key'
        res[[school]] <-
            list(team=getNodeSet(elt,
                   "normalize-space(/SCHOOL/TEAMS/HOCKEY/text())"),
                 students= xpathSApply(elt, "//STUDENT", xmlValue))
    }, getres = function() {
        ## retrieve the 'res' environment when done
        res
    }, get=function() {
        ## retrieve 'res' environment as data.frame
        school <- ls(res)
        team <- unlist(eapply(res, "[[", "team"), use.names=FALSE)
        student <- eapply(res, "[[", "students")
        len <- sapply(student, length)
        data.frame(school=rep(school, len), team=rep(team, len),
                   student=unlist(student, use.names=FALSE))
    })
}

我们使用函数作为

branches <- teams()
xmlEventParse("event.xml", handlers=NULL, branches=branches)
branches$get()

这篇关于在巨大的 XML 文件中组合值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆