读取大型Excel xlsx文件的最快方法?是否并行化? [英] Fastest way to read large Excel xlsx files? To parallelize or not?
问题描述
我的问题是:
-
将大型(ish).xlsx Excel文件读入R最快的方法是什么? 10到200 MB的xlsx文件,有多张纸.
What is the fastest way to read large(ish) .xlsx Excel files into R? 10 to 200 MB xlsx files, with multiple sheets.
是否可以使用某种并行处理,例如每个核心读物 多页Excel文件的另一页?
Can some kind of parallel processing be used, e.g. each core reading a separate sheet of a multi-sheet Excel file?
还有其他可以执行的优化吗?
Is there any other kind of optimisation that can be performed?
到目前为止,我所了解的(以及我尚未了解的):
- 如果从旋转磁盘读取数据,正如我将要进行的那样,并行处理实际上可能会减慢读取速度,因为多个进程试图从同一文件读取数据.但是,并行过程可能对诸如转换和推断数据类型之类的事情有帮助吗?不知道readxl从磁盘读取(我认为是IO绑定)还是转换数据类型(我估计是CPU绑定)花了多少钱.
- 这可能与SSD驱动器有所不同.如果有很大的改进,我可能会将数据复制到SSD驱动器并从那里读取.
- data.table :: fread可以加快文本文件的读取速度(尽管我不完全理解为什么),但是它不能用于excel文件-可以吗?
- 我从此答案了解到,
readxl
往往比openxlsx
快
- if reading from spinning disks, as I will, parallel processing may actually slow down the reading as multiple processes try to read from the same file. However, parallel process may help with things like converting and inferring data types? Not sure how much readxl spends reading from disk (which I assume is IO bound) vs converting data types (which I guess is CPU bound).
- This may be different with SSD drives. I might copy data to an SSD drive and read from there if there's a massive improvement.
- data.table::fread speeds up the reading of text files (although I don't fully understand why) but it cannot be used for excel files - or can it?
- I understand from this answer that
readxl
tends to be faster thanopenxlsx
我只对表格数据感兴趣;我对Excel格式,图表,文本标签或任何其他类型的数据都不感兴趣.
I am interested in tabular data only; I am not interested in the Excel formatting, nor in charts, text labels or any other kind of data.
我可能希望导入tidyverse的小玩意,但不一定.然后,我需要将表导出到Microsoft SQL Server中.
I am possibly looking to import into tidyverse tibbles, but not necessarily. I will then need to export the tables into a Microsoft SQL Server.
某些背景:我主要使用Python,对R完全陌生.在Python中读取大型Excel文件的速度非常慢.我已经看到R的readxl
比Python的
Some background: I mostly use Python and am totally new to R. Reading large Excel files in Python is painfully slow. I have already seen that R's readxl
is much faster than Python's pandas
(on a 15-sheet xlsx, each sheet with 10,000 rows and 32 columns: 5.6 seconds for readxl vs 33 seconds for pandas), so that's great! I would, however, still like to understand if there is any way to make the import even faster. I can read the files with R, export them to SQL, then continue the rest of my workflow with Python reading from SQL.
我认为转换为CSV并不是最好的选择,尤其是当readxl的速度比Python快得多时,尤其如此.基本上转换为csv所花的时间可能比我从csv读取而不是excel所节省的时间要长.另外,至少对于Python(我不太了解R是否足以使用readxl对其进行全面测试)而言,使用xlsx推断数据类型要比使用csv更好.
I don't think converting to CSV is the best option, especially not when readxl is so much faster than Python anyway; basically converting to csv may easily take longer than the time I'd save by reading from csv rather than excel. Plus, at least with Python (I don't really know enough R to have tested this thoroughly with readxl), inferring data types works much better with xlsx than with csv.
我的代码(欢迎提出任何批评或建议)
My code (any critique or suggestion is more than welcome):
library(readxl)
library(tidyverse)
library(tictoc)
this.dir <- dirname(parent.frame(2)$ofile)
setwd(this.dir)
tic("readxl")
path <- "myfile.xlsx"
sheetnames <- excel_sheets(path)
mylist <- lapply(excel_sheets(path), read_excel, path = path)
names(mylist) <- sheetnames
toc()
推荐答案
您可以尝试使用parallel
包并行运行它,但是要估算没有样本数据的速度将有些困难:
You could try to run it in parallel using the parallel
package, but it is a bit hard to estimate how fast it will be without sample data:
library(parallel)
library(readxl)
excel_path <- ""
sheets <- excel_sheets(excel_path)
使集群具有指定数量的内核:
Make a cluster with a specified number of cores:
cl <- makeCluster(detectCores() - 1)
使用parLapplyLB
浏览excel工作表并使用负载平衡并行读取它们:
Use parLapplyLB
to go through the excel sheets and read them in parallel using load balancing:
parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)
您可以使用软件包microbenchmark
测试某些选项的运行速度:
You can use the package microbenchmark
to test how fast certain options are:
library(microbenchmark)
microbenchmark(
lapply = {lapply(sheets, function(sheet) {
read_excel(excel_path, sheet = sheet)
})},
parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)},
times = 10
)
就我而言,并行版本更快:
In my case, the parallel version is faster:
Unit: milliseconds
expr min lq mean median uq max neval
lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890 10
parralel 58.94018 64.96452 118.5969 71.42688 80.48588 316.9914 10
测试文件包含6张纸,每张纸都包含此表:
The test file contains of 6 sheets, each containing this table:
test test1 test3 test4 test5
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
注意:
您可以在处理完成后使用stopCluster(cl)
关闭工作程序.
Note:
you can use stopCluster(cl)
to shut down the workers when the process is finished.
这篇关于读取大型Excel xlsx文件的最快方法?是否并行化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!