readxl :: read_xls返回"libxls错误:无法打开文件". [英] readxl::read_xls returns "libxls error: Unable to open file"
问题描述
我有多个.xls(〜100MB)文件,我想从中将多个工作表(每个工作表)作为数据帧加载到R中.我尝试了各种功能,例如 xlsx :: xlsx2
和 XLConnect :: readWorksheetFromFile
,这两个功能始终运行很长时间(> 15分钟),并且从未完成而且我不得不退出RStudio才能继续工作.
我还尝试了 gdata :: read.xls
,该操作确实完成了,但是每张纸要花费3分钟以上的时间,并且无法一次提取多张纸(这对于提高速度非常有帮助)我的管道)就像 XLConnect :: loadWorkbook
一样.
执行这些函数所花费的时间(而且我甚至不确定如果让它们运行更长的时间,前两个函数是否会完成)对于我的管道来说太长了,我需要一次处理多个文件.有没有办法使它们更快地完成/完成?
在一些地方,我看到了使用函数 readxl :: read_xls
的建议,该函数似乎被广泛推荐用于此任务,并且每张纸应该更快.但是,这给了我一个错误:
>#最小的可重现示例:>setwd("/Users/USER/Desktop")>图书馆(readxl)>数据<-read_xls(path ="test_file.xls")错误:文件路径:/Users/USER/Desktop/test_file.xlslibxls错误:无法打开文件
我还做了一些基础测试,以确保文件存在并且格式正确:
>#测试存在与否文件格式>file.exists("test_file.xls")[1]是>format_from_ext("test_file.xls")[1]"xls">format_from_signature("test_file.xls")[1]"xls"
上面使用的 test_file.xls
可用
同样,您可以使用 read_xls
函数代替 read_excel
.
我检查了一下,它也可以正常工作甚至更快一点,因为 read_excel
是 read_xls
和 read_xlsx
函数的包装> readxl 包.
此外,您可以使用 readxl
包中的 excel_sheets
函数来读取Excel文件的所有工作表.
更新
基准测试是通过 microbenchmark
软件包完成的,用于以下软件包/功能: gdata :: read.xls
, XLConnect :: readWorksheetFromFile
和 readxl :: read_excel
.
但是 XLConnect
是基于Java的解决方案,因此需要大量RAM.
I have multiple .xls (~100MB) files from which I would like to load multiple sheets (from each) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2
and XLConnect::readWorksheetFromFile
, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.
I also tried gdata::read.xls
, which does finish, but it takes more than 3 minutes per one sheet and it cannot extract multiple sheets at once (which would be very helpful to speed up my pipeline) like XLConnect::loadWorkbook
can.
The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go/finish faster?
In several places, I have seen a recommendation to use the function readxl::read_xls
, which seems to be widely recommended for this task and should be faster per sheet. This one, however, gives me an error:
> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
Error:
filepath: /Users/USER/Desktop/test_file.xls
libxls error: Unable to open file
I also did some elementary testing to make sure the file exists and is in the correct format:
> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"
> format_from_signature("test_file.xls")
[1] "xls"
The test_file.xls
used above is available here.
Any advice would be appreciated in terms of making the first functions run faster or the read_xls
run at all - thank you!
UPDATE:
It seems that some users are able to open the file above using the readxl::read_xls
function, while others are not, both on Mac and Windows, using the most up to date versions of R
, Rstudio
, and readxl
. The issue has been posted on the readxl GitHub and has not been resolved yet.
I downloaded your dataset and read each excel sheet in this way (for example, for sheets "Overall" and "Area"):
install.packages("readxl")
library(readxl)
library(data.table)
dt_overall <- as.data.table(read_excel("test_file.xls", sheet = "Overall"))
area_sheet <- as.data.table(read_excel("test_file.xls", sheet = "Area"))
Finally, I get dt like this (for example, only part of the dataset for the "Area" sheet):
Just as well, you can use the read_xls
function instead read_excel
.
I checked, it also works correctly and even a little faster, since read_excel
is a wrapper over read_xls
and read_xlsx
functions from readxl
package.
Also, you can use excel_sheets
function from readxl
package to read all sheets of your Excel file.
UPDATE
Benchmarking is done with microbenchmark
package for the following packages/functions: gdata::read.xls
, XLConnect::readWorksheetFromFile
and readxl::read_excel
.
But XLConnect
it's a Java-based solution, so it requires a lot of RAM.
这篇关于readxl :: read_xls返回"libxls错误:无法打开文件".的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!