解析存储在zip文件中的多个XBRL文件 [英] Parse multiple XBRL files stored in a zip file

查看:121
本文介绍了解析存储在zip文件中的多个XBRL文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经从一个网站下载了多个zip文件.每个zip文件都包含多个htmlxml扩展文件(每个〜100K).

I have downloaded multiple zip files from a website. Each zip file contains multiple html and xml extension files (~ 100K in each).

可以手动提取文件然后解析它们.但是,我希望能够在R(如果可能)

It is possible to manually extract the files and then parse them. However, i would like to be able to do this within R (if possible)

示例文件(抱歉,它有点大),使用来自 上一个问题 -下载一个zip文件

Example file (sorry it is a bit big) using code from a previous question - download one zip file

library(XML)

pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
doc <- htmlParse(pth)

myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs][[1]]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]]

dir.create("temp", "hmrcCache")
download.file(fileURLS, destfile = file.path("temp", myfiles))

我可以使用 XBRL package (如果我手动提取它们. 可以按照以下步骤完成

I can parse the files using the XBRL package if i manually extract them. This can be done as follows

library(XBRL)     
inst <- file.path("temp", "Prod224_0004_00000121_20130630.html")
out <- xbrlDoAll(inst, cache.dir="temp/hmrcCache", prefix.out=NULL, verbose=T)

我正在努力从zip文件夹中提取这些文件并解析每个文件,例如,使用R在循环中进行分析,而无需手动提取它们. 我尝试开始,但是不知道如何从这里开始.感谢您的任何建议.

I am struggling with how to extract these files from the zip folder and parse each , say, in a loop using R, without manually extracting them. I tried making a start, but don't know how to progress from here. Thanks for any advice.

# Get names of files
lst <- unzip(file.path("temp", myfiles), list=TRUE)
dim(lst) # 118626

# unzip  and extract first file
nms <- lst$Name[1] # Prod224_0004_00000121_20130630.html
lst2 <- unz(file.path("temp", myfiles), filename=nms)

我正在使用Windows 8.1

I am using Windows 8.1

R版本3.1.2(2014-10-31)

R version 3.1.2 (2014-10-31)

平台:x86_64-w64-mingw32/x64(64位)

Platform: x86_64-w64-mingw32/x64 (64-bit)

推荐答案

我使用Karsten在注释中的建议,将文件解压缩到一个临时目录,然后解析每个文件.我使用了snow包来加快速度.

Using the suggestion from Karsten in the comments, I unzipped the files to a temporary directory, and then parsed each file. I used the snow package to speed things up.

  # Parse one zip file to start
  fls <- list.files(temp)[[1]]

  # Unzip 
  tmp <- tempdir()
  lst <- unzip(file.path(temp, fls), exdir=tmp)

  # Only parse first 10 records
  inst <- lst[1:10]

  # Start to parse - in parallel
  cl <- makeCluster(parallel::detectCores())
  clusterCall(cl, function() library(XBRL))

  # Start
  st <- Sys.time()

  out <- parLapply(cl, inst, function(i) 
                                  xbrlDoAll(i, 
                                            cache.dir="temp/hmrcCache", 
                                            prefix.out=NULL, verbose=T) )

  stopCluster(cl)

  Sys.time() - st

(我不确定我是否正确使用了tempdir(),因为这似乎会将大量数据保存到Local\Temp目录中,如果处理不正确,我将欢迎提出评论,谢谢.)

(I am not sure that I am using the tempdir() correctly as this seems to save large amounts of data to the Local\Temp directory - I would welcome comments if I have approached this incorrectly, thanks).

这篇关于解析存储在zip文件中的多个XBRL文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆