将大型XML解析为R中的数据框 [英] Parsing large XML to dataframe in R

查看:111
本文介绍了将大型XML解析为R中的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大型XML文件,我想将其转换为数据框,以便在R和其他程序中进行进一步处理.这些都是在macOS中完成的.

I have large XML files that I want to turn into dataframes for further processing within R and other programs. This is all being done in macOS.

每个月的XML大小约为1gb,有15万条记录和191个不同的变量.最后,我可能不需要全部191个变量,但我想保留它们并稍后决定.

Each monthly XML is around 1gb large, has 150k records and 191 different variables. In the end I might not need the full 191 variables but I'd like to keep them and decide later.

可以在此处访问XML文件(滚动到每月拉链的底部,如果未压缩,则应查看"dming" XML)

The XML files can be accessed here (scroll to the bottom for the monthly zips, when uncompressed one should look at "dming" XMLs)

我已经取得了一些进步,但是处理较大文件的时间太长了(见下文)

I've made some progress but processing for larger files takes too long (see below)

XML看起来像这样:

The XML looks like this:

<ROOT>
 <ROWSET_DUASDIA>
  <ROW_DUASDIA NUM="1">
   <variable1>value</variable1>
   ...
   <variable191>value</variable191>
  </ROW_DUASDIA>
  ...
  <ROW_DUASDIA NUM="150236">
   <variable1>value</variable1>
   ...
   <variable191>value</variable191>
  </ROW_DUASDIA>
 </ROWSET_DUASDIA>
</ROOT>

我希望这很清楚.这是我第一次使用XML.

I hope that's clear enough. This is my first time working with an XML.

我在这里查看了很多答案,实际上设法使用较小的样本(使用每日XML而不是每月XML)和xml2将数据放入数据框.这就是我所做的

I've looked at many answers here and in fact managed to get the data into a dataframe using a smaller sample (using a daily XML instead of the monthly ones) and xml2. Here's what I did

library(xml2) 

raw <- read_xml(filename)

# Find all records
dua <- xml_find_all(raw,"//ROW_DUASDIA")

# Create empty dataframe
dualen <- length(dua)
varlen <- length(xml_children(dua[[1]]))
df <- data.frame(matrix(NA,nrow=dualen,ncol=varlen))

# For loop to enter the data for each record in each row
for (j in 1:dualen) {
  df[j, ] <- xml_text(xml_children(dua[[j]]),trim=TRUE)
}

# Name columns
colnames(df) <- c(names(as_list(dua[[1]])))

我认为这还很初级,但是我对R还是很陌生.

I imagine that's fairly rudimentary but I'm also pretty new to R.

无论如何,这对于每日数据(4-5k条记录)可以正常工作,但是对于15万条记录来说效率可能太低,实际上我已经等了几个小时才完成.当然,我每个月只需要运行一次此代码,但是我仍然希望对其进行改进.

Anyway, this works fine with daily data (4-5k records), but it's probably too inefficient for 150k records, and in fact I waited a couple hours and it hadn't finished. Granted, I would only need to run this code once a month but I would like to improve it nonetheless.

我尝试使用xml2中的as_list函数将所有记录的元素转换为列表,以便我可以继续使用plyr,但这也花费了很长时间.

I tried to turn the elements for all records into a list using the as_list function within xml2 so I could continue with plyr, but this also took too long.

谢谢.

推荐答案

虽然不能保证在较大的XML文件上有更好的性能,但("old school")XML包维护着紧凑的数据帧处理程序,,用于像您这样的平面XML文件.其他兄弟姐妹中任何可用的丢失节点都会导致对应字段的NA.

While there is no guarantee of better performance on larger XML files, the ("old school") XML package maintains a compact data frame handler, xmlToDataFrame, for flat XML files like yours. Any missing nodes available in other siblings result in NA for corresponding fields.

library(XML)

doc <- xmlParse("/path/to/file.xml")
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA"))


您甚至可以想象,您可以下载日常zip,解压缩需要的XML,并在每月的大型XML带来内存挑战的情况下将其解析为数据框.例如,下面将2018年12月的每日数据提取到要在行尾绑定的数据帧列表中. Process甚至添加一个 DDate 字段.由于缺少日期或其他URL或zip问题,方法被包裹在tryCatch中.


You can even conceivably download the daily zips, unzip need XML, and parse it into data frame should the large monthly XMLs pose memory challenges. As example, below extracts December 2018 daily data into a list of data frames to be row binded at end. Process even adds a DDate field. Method is wrapped in a tryCatch due to missing days in sequence or other URL or zip issues.

dec_urls <- paste0(1201:1231)
temp_zip <- "/path/to/temp.zip"
xml_folder <- "/path/to/xml/folder"

xml_process <- function(dt) {      
  tryCatch({
    # DOWNLOAD ZIP TO URL
    url <- paste0("ftp://ftp.aduanas.gub.uy/DUA%20Diarios%20XML/2018/dd2018", dt,".zip")
    file <- paste0(xml_folder, "/dding2018", dt, ".xml")

    download.file(url, temp_zip)
    unzip(temp_zip, files=paste0("dding2018", dt, ".xml"), exdir=xml_folder)
    unlink(temp_zip)           # DESTROY TEMP ZIP

    # PARSE XML TO DATA FRAME
    doc <- xmlParse(file)        
    df <- transform(xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA")),
                    DDate = as.Date(paste("2018", dt), format="%Y%m%d", origin="1970-01-01"))
    unlink(file)               # DESTROY TEMP XML

    # RETURN XML DF
    return(df)
  }, error = function(e) NA)      
}

# BUILD LIST OF DATA FRAMES
dec_df_list <- lapply(dec_urls, xml_process)

# FILTER OUT "NAs" CAUGHT IN tryCatch
dec_df_list <- Filter(NROW, dec_df_list)

# ROW BIND TO FINAL SINGLE DATA FRAME
dec_final_df <- do.call(rbind, dec_df_list)

这篇关于将大型XML解析为R中的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆