读取大型 Excel xlsx 文件的最快方法?并行与否? [英] Fastest way to read large Excel xlsx files? To parallelize or not?

查看:54
本文介绍了读取大型 Excel xlsx 文件的最快方法?并行与否?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是:

  • 将大型(ish).xlsx Excel 文件读入 R 的最快方法是什么? 10 到 200 MB 的 xlsx 文件,包含多张工作表.

  • What is the fastest way to read large(ish) .xlsx Excel files into R? 10 to 200 MB xlsx files, with multiple sheets.

是否可以使用某种并行处理,例如每个核心阅读多页 Excel 文件的单独工作表?

Can some kind of parallel processing be used, e.g. each core reading a separate sheet of a multi-sheet Excel file?

还有其他类型的优化可以执行吗?

Is there any other kind of optimisation that can be performed?

到目前为止我所理解的(以及我还没有理解的):

  • 如果像我一样从旋转磁盘读取,并行处理实际上可能会减慢读取速度,因为多个进程试图从同一个文件中读取.但是,并行过程可能有助于转换和推断数据类型之类的事情?不确定 readxl 从磁盘读取(我认为是 IO 限制)与转换数据类型(我猜是 CPU 限制)花费了多少.
  • 这可能与 SSD 驱动器不同.如果有重大改进,我可能会将数据复制到 SSD 驱动器并从那里读取.
  • data.table::fread 加快了文本文件的读取速度(虽然我不完全明白为什么)但它不能用于 excel 文件 - 或者可以吗?
  • 我从这个answer了解到readxl往往比openxlsx快
  • if reading from spinning disks, as I will, parallel processing may actually slow down the reading as multiple processes try to read from the same file. However, parallel process may help with things like converting and inferring data types? Not sure how much readxl spends reading from disk (which I assume is IO bound) vs converting data types (which I guess is CPU bound).
  • This may be different with SSD drives. I might copy data to an SSD drive and read from there if there's a massive improvement.
  • data.table::fread speeds up the reading of text files (although I don't fully understand why) but it cannot be used for excel files - or can it?
  • I understand from this answer that readxl tends to be faster than openxlsx

我只对表格数据感兴趣;我对 Excel 格式、图表、文本标签或任何其他类型的数据不感兴趣.

I am interested in tabular data only; I am not interested in the Excel formatting, nor in charts, text labels or any other kind of data.

我可能希望导入 tidyverse tibbles,但不一定.然后我需要将表导出到 Microsoft SQL Server.

I am possibly looking to import into tidyverse tibbles, but not necessarily. I will then need to export the tables into a Microsoft SQL Server.

一些背景:我主要使用 Python 并且对 R 完全陌生.在 Python 中读取大型 Excel 文件非常缓慢. 我已经看到 R 的 readxl 很多比 Python 的 pandas 快(在 15 张 xlsx 上,每张纸有 10,000 行和 32 列:readxl 为 5.6 秒,pandas 为 33 秒),太棒了!但是,我仍然想了解是否有任何方法可以使导入速度更快.我可以使用 R 读取文件,将它们导出到 SQL,然后使用 Python 读取 SQL 继续我的工作流程的其余部分.

Some background: I mostly use Python and am totally new to R. Reading large Excel files in Python is painfully slow. I have already seen that R's readxl is much faster than Python's pandas (on a 15-sheet xlsx, each sheet with 10,000 rows and 32 columns: 5.6 seconds for readxl vs 33 seconds for pandas), so that's great! I would, however, still like to understand if there is any way to make the import even faster. I can read the files with R, export them to SQL, then continue the rest of my workflow with Python reading from SQL.

我不认为转换为 CSV 是最好的选择,尤其是当 readxl 无论如何都比 Python 快得多时;基本上转换为 csv 可能比我通过从 csv 而不是 excel 读取所节省的时间花费更长的时间.另外,至少对于 Python(我真的不知道足够的 R 来用 readxl 对此进行彻底的测试),推断数据类型在 xlsx 中比在 csv 中工作得更好.

I don't think converting to CSV is the best option, especially not when readxl is so much faster than Python anyway; basically converting to csv may easily take longer than the time I'd save by reading from csv rather than excel. Plus, at least with Python (I don't really know enough R to have tested this thoroughly with readxl), inferring data types works much better with xlsx than with csv.

我的代码(欢迎任何批评或建议):

My code (any critique or suggestion is more than welcome):

library(readxl)
library(tidyverse)
library(tictoc)


this.dir <- dirname(parent.frame(2)$ofile)
setwd(this.dir)

tic("readxl")

path <- "myfile.xlsx"
sheetnames <- excel_sheets(path)
mylist <- lapply(excel_sheets(path), read_excel, path = path)

names(mylist) <- sheetnames
toc()

推荐答案

您可以尝试使用 parallel 包并行运行它,但很难估计它的速度有多快没有样本数据:

You could try to run it in parallel using the parallel package, but it is a bit hard to estimate how fast it will be without sample data:

library(parallel)
library(readxl)

excel_path <- ""
sheets <- excel_sheets(excel_path)

制作具有指定核心数的集群:

Make a cluster with a specified number of cores:

cl <- makeCluster(detectCores() - 1)

使用 parLapplyLB 浏览 Excel 表格并使用负载平衡并行读取它们:

Use parLapplyLB to go through the excel sheets and read them in parallel using load balancing:

parLapplyLB(cl, sheets, function(sheet, excel_path) {
  readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)

您可以使用包 microbenchmark 来测试某些选项的速度:

You can use the package microbenchmark to test how fast certain options are:

library(microbenchmark)

microbenchmark(
  lapply = {lapply(sheets, function(sheet) {
    read_excel(excel_path, sheet = sheet)
  })},
  parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
    readxl::read_excel(excel_path, sheet = sheet)
  }, excel_path)},
  times = 10
)

就我而言,并行版本更快:

In my case, the parallel version is faster:

Unit: milliseconds
     expr       min        lq     mean    median        uq      max neval
   lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890    10
 parralel  58.94018  64.96452 118.5969  71.42688  80.48588 316.9914    10

测试文件包含 6 张纸,每张纸都包含这张表:

The test file contains of 6 sheets, each containing this table:

    test test1 test3 test4 test5
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
11    11    11    11    11    11
12    12    12    12    12    12
13    13    13    13    13    13
14    14    14    14    14    14
15    15    15    15    15    15

注意:您可以使用 stopCluster(cl) 在进程完成后关闭工作进程.

Note: you can use stopCluster(cl) to shut down the workers when the process is finished.

这篇关于读取大型 Excel xlsx 文件的最快方法?并行与否?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆