解析存储在zip文件中的多个XBRL文件 [英] Parse multiple XBRL files stored in a zip file
问题描述
我已经从一个网站下载了多个zip文件.每个zip文件都包含多个html
和xml
扩展文件(每个〜100K).
I have downloaded multiple zip files from a website. Each zip file contains multiple html
and xml
extension files (~ 100K in each).
可以手动提取文件然后解析它们.但是,我希望能够在R
(如果可能)
It is possible to manually extract the files and then parse them. However, i would like to be able to do this within R
(if possible)
示例文件(抱歉,它有点大),使用来自 上一个问题 -下载一个zip文件
Example file (sorry it is a bit big) using code from a previous question - download one zip file
library(XML)
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs][[1]]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]]
dir.create("temp", "hmrcCache")
download.file(fileURLS, destfile = file.path("temp", myfiles))
我可以使用
XBRL package
(如果我手动提取它们.
可以按照以下步骤完成
I can parse the files using the
XBRL package
if i manually extract them.
This can be done as follows
library(XBRL)
inst <- file.path("temp", "Prod224_0004_00000121_20130630.html")
out <- xbrlDoAll(inst, cache.dir="temp/hmrcCache", prefix.out=NULL, verbose=T)
我正在努力从zip文件夹中提取这些文件并解析每个文件,例如,使用R在循环中进行分析,而无需手动提取它们. 我尝试开始,但是不知道如何从这里开始.感谢您的任何建议.
I am struggling with how to extract these files from the zip folder and parse each , say, in a loop using R, without manually extracting them. I tried making a start, but don't know how to progress from here. Thanks for any advice.
# Get names of files
lst <- unzip(file.path("temp", myfiles), list=TRUE)
dim(lst) # 118626
# unzip and extract first file
nms <- lst$Name[1] # Prod224_0004_00000121_20130630.html
lst2 <- unz(file.path("temp", myfiles), filename=nms)
我正在使用Windows 8.1
I am using Windows 8.1
R版本3.1.2(2014-10-31)
R version 3.1.2 (2014-10-31)
平台:x86_64-w64-mingw32/x64(64位)
Platform: x86_64-w64-mingw32/x64 (64-bit)
推荐答案
我使用Karsten在注释中的建议,将文件解压缩到一个临时目录,然后解析每个文件.我使用了snow
包来加快速度.
Using the suggestion from Karsten in the comments, I unzipped the files to a temporary directory, and then parsed each file. I used the snow
package to speed things up.
# Parse one zip file to start
fls <- list.files(temp)[[1]]
# Unzip
tmp <- tempdir()
lst <- unzip(file.path(temp, fls), exdir=tmp)
# Only parse first 10 records
inst <- lst[1:10]
# Start to parse - in parallel
cl <- makeCluster(parallel::detectCores())
clusterCall(cl, function() library(XBRL))
# Start
st <- Sys.time()
out <- parLapply(cl, inst, function(i)
xbrlDoAll(i,
cache.dir="temp/hmrcCache",
prefix.out=NULL, verbose=T) )
stopCluster(cl)
Sys.time() - st
(我不确定我是否正确使用了tempdir()
,因为这似乎会将大量数据保存到Local\Temp
目录中,如果处理不正确,我将欢迎提出评论,谢谢.)
(I am not sure that I am using the tempdir()
correctly as this seems to save large amounts of data to the Local\Temp
directory - I would welcome comments if I have approached this incorrectly, thanks).
这篇关于解析存储在zip文件中的多个XBRL文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!