读取大型Excel xlsx文件的最快方法?是否并行化? [英] Fastest way to read large Excel xlsx files? To parallelize or not?

查看:352
本文介绍了读取大型Excel xlsx文件的最快方法?是否并行化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是:

  • 将大型(ish).xlsx Excel文件读入R最快的方法是什么? 10到200 MB的xlsx文件,有多张纸.

  • What is the fastest way to read large(ish) .xlsx Excel files into R? 10 to 200 MB xlsx files, with multiple sheets.

是否可以使用某种并行处理,例如每个核心读物 多页Excel文件的另一页?

Can some kind of parallel processing be used, e.g. each core reading a separate sheet of a multi-sheet Excel file?

还有其他可以执行的优化吗?

Is there any other kind of optimisation that can be performed?

到目前为止,我所了解的(以及我尚未了解的):

  • 如果从旋转磁盘读取数据,正如我将要进行的那样,并行处理实际上可能会减慢读取速度,因为多个进程试图从同一文件读取数据.但是,并行过程可能对诸如转换和推断数据类型之类的事情有帮助吗?不知道readxl从磁盘读取(我认为是IO绑定)还是转换数据类型(我估计是CPU绑定)花了多少钱.
  • 这可能与SSD驱动器有所不同.如果有很大的改进,我可能会将数据复制到SSD驱动器并从那里读取.
  • data.table :: fread可以加快文本文件的读取速度(尽管我不完全理解为什么),但是它不能用于excel文件-可以吗?
  • 我从此答案了解到,readxl往往比openxlsx
  • if reading from spinning disks, as I will, parallel processing may actually slow down the reading as multiple processes try to read from the same file. However, parallel process may help with things like converting and inferring data types? Not sure how much readxl spends reading from disk (which I assume is IO bound) vs converting data types (which I guess is CPU bound).
  • This may be different with SSD drives. I might copy data to an SSD drive and read from there if there's a massive improvement.
  • data.table::fread speeds up the reading of text files (although I don't fully understand why) but it cannot be used for excel files - or can it?
  • I understand from this answer that readxl tends to be faster than openxlsx

我只对表格数据感兴趣;我对Excel格式,图表,文本标签或任何其他类型的数据都不感兴趣.

I am interested in tabular data only; I am not interested in the Excel formatting, nor in charts, text labels or any other kind of data.

我可能希望导入tidyverse的小玩意,但不一定.然后,我需要将表导出到Microsoft SQL Server中.

I am possibly looking to import into tidyverse tibbles, but not necessarily. I will then need to export the tables into a Microsoft SQL Server.

某些背景:我主要使用Python,对R完全陌生.在Python中读取大型Excel文件的速度非常慢.我已经看到R的readxl比Python的(在15页xlsx上,每页具有10,000行和32列:readxl 5.6秒,熊猫33秒),这太棒了!但是,我仍然想了解是否有任何方法可以使导入速度更快.我可以使用R读取文件,将其导出到SQL,然后使用从SQL读取Python继续我的工作流程的其余部分.

Some background: I mostly use Python and am totally new to R. Reading large Excel files in Python is painfully slow. I have already seen that R's readxl is much faster than Python's pandas (on a 15-sheet xlsx, each sheet with 10,000 rows and 32 columns: 5.6 seconds for readxl vs 33 seconds for pandas), so that's great! I would, however, still like to understand if there is any way to make the import even faster. I can read the files with R, export them to SQL, then continue the rest of my workflow with Python reading from SQL.

我认为转换为CSV并不是最好的选择,尤其是当readxl的速度比Python快得多时,尤其如此.基本上转换为csv所花的时间可能比我从csv读取而不是excel所节省的时间要长.另外,至少对于Python(我不太了解R是否足以使用readxl对其进行全面测试)而言,使用xlsx推断数据类型要比使用csv更好.

I don't think converting to CSV is the best option, especially not when readxl is so much faster than Python anyway; basically converting to csv may easily take longer than the time I'd save by reading from csv rather than excel. Plus, at least with Python (I don't really know enough R to have tested this thoroughly with readxl), inferring data types works much better with xlsx than with csv.

我的代码(欢迎提出任何批评或建议)

My code (any critique or suggestion is more than welcome):

library(readxl)
library(tidyverse)
library(tictoc)


this.dir <- dirname(parent.frame(2)$ofile)
setwd(this.dir)

tic("readxl")

path <- "myfile.xlsx"
sheetnames <- excel_sheets(path)
mylist <- lapply(excel_sheets(path), read_excel, path = path)

names(mylist) <- sheetnames
toc()

推荐答案

您可以尝试使用parallel包并行运行它,但是要估算没有样本数据的速度将有些困难:

You could try to run it in parallel using the parallel package, but it is a bit hard to estimate how fast it will be without sample data:

library(parallel)
library(readxl)

excel_path <- ""
sheets <- excel_sheets(excel_path)

使集群具有指定数量的内核:

Make a cluster with a specified number of cores:

cl <- makeCluster(detectCores() - 1)

使用parLapplyLB浏览excel工作表并使用负载平衡并行读取它们:

Use parLapplyLB to go through the excel sheets and read them in parallel using load balancing:

parLapplyLB(cl, sheets, function(sheet, excel_path) {
  readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)

您可以使用软件包microbenchmark测试某些选项的运行速度:

You can use the package microbenchmark to test how fast certain options are:

library(microbenchmark)

microbenchmark(
  lapply = {lapply(sheets, function(sheet) {
    read_excel(excel_path, sheet = sheet)
  })},
  parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
    readxl::read_excel(excel_path, sheet = sheet)
  }, excel_path)},
  times = 10
)

就我而言,并行版本更快:

In my case, the parallel version is faster:

Unit: milliseconds
     expr       min        lq     mean    median        uq      max neval
   lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890    10
 parralel  58.94018  64.96452 118.5969  71.42688  80.48588 316.9914    10

测试文件包含6张纸,每张纸都包含此表:

The test file contains of 6 sheets, each containing this table:

    test test1 test3 test4 test5
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
11    11    11    11    11    11
12    12    12    12    12    12
13    13    13    13    13    13
14    14    14    14    14    14
15    15    15    15    15    15

注意: 您可以在处理完成后使用stopCluster(cl)关闭工作程序.

Note: you can use stopCluster(cl) to shut down the workers when the process is finished.

这篇关于读取大型Excel xlsx文件的最快方法?是否并行化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆