如何加快大型xlsx文件的导入? [英] How to speed up import of large xlsx files?

查看:145
本文介绍了如何加快大型xlsx文件的导入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想处理一个200MB的大型Excel(xlsx)文件,其中包含15张工作表和100万行(每行5列),并根据数据创建一个熊猫数据框。 Excel文件的导入非常慢(最多10分钟)。不幸的是,Excel导入文件格式是强制性的(我知道csv更快...)。

I want to process a large 200MB Excel (xlsx) file with 15 sheets and 1 million rows with 5 columns each) and create a pandas dataframe from the data. The import of the Excel file is extremely slow (up to 10minutes). Unfortunately, the Excel import file format is mandatory (I know that csv is faster...).

如何加快将大型Excel文件导入熊猫数据框的过程?如果可能的话,将时间缩短到1-2分钟左右将是很棒的,这将是可以忍受的。

How can I speed up the process of importing a large Excel file into a pandas dataframe? Would be great to get the time down to around 1-2 minutes, if possible, which would be much more bearable.

到目前为止我已经尝试过:

What I have tried so far:

选项1-熊猫I / O read_excel

%%timeit -r 1
import pandas as pd
import datetime

xlsx_file = pd.ExcelFile("Data.xlsx")
list_sheets = []

for sheet in xlsx_file.sheet_names:
    list_sheets.append(xlsx_file.parse(sheet, header = 0, dtype={
        "Sales": float,
        "Client": str, 
        "Location": str, 
        "Country": str, 
        "Date": datetime.datetime
        }).fillna(0))

output_dataframe = pd.concat(list_sheets)

10min 44s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

选项2-黄昏

%%timeit -r 1
import pandas as pd
import dask
import dask.dataframe as dd
from dask.delayed import delayed

excel_file = "Data.xlsx"

parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0)
output_dataframe = dd.from_delayed(parts)

10min 12s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

选项3-openpyxl和csv

从Excel工作簿创建单独的csv文件只需要大约10分钟,甚至还可以通过 read_csv

Just creating the seperate csv files from the Excel workbook took around 10 minutes before even importing the csv files to a pandas dataframe via read_csv

%%timeit -r 1
import openpyxl
import csv

from openpyxl import load_workbook
wb = load_workbook(filename = "Data.xlsx", read_only=True)

list_ws = wb.sheetnames
nws = len(wb.sheetnames) #number of worksheets in workbook

# create seperate csv files from each worksheet (15 in total)
for i in range(0, nws):
    ws = wb[list_ws[i]]
    with open("output/%s.csv" %(list_ws[i].replace(" ","")), "w", newline="") as f:
        c = csv.writer(f)
        for r in ws.rows:
            c.writerow([cell.value for cell in r])

9min 31s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

我在一台计算机(Windows 10)上使用Python 3.7.3(64位),16GB RAM,8核(i7-8650U CPU @ 1.90GHz)。我在IDE中运行代码(Visual Studio代码)。

I use Python 3.7.3 (64bit) on a single machine (Windows 10), 16GB RAM, 8 cores (i7-8650U CPU @ 1.90GHz). I run the code within my IDE (Visual Studio Code).

推荐答案

压缩不是瓶颈,问题在于解析XML,并在Python中创建新的数据结构。从您引用的速度来看,我认为这些文件非常大:有关更多详细信息,请参见文档中有关性能的说明。 xlrd和openpyxl都在接近底层Python和C库的极限运行。

The compression isn't the bottleneck, the problem is parsing the XML and creating new data structures in Python. Judging from the speeds you're quoting I'm assuming these are very large files: see the note on performance in the documentation for more details. Both xlrd and openpyxl are running close to the limits of the underyling Python and C libraries.

从openpyxl 2.6开始,您确实拥有 values_only 选项在读取单元格时会加快速度。您还可以使用多个进程以只读模式并行读取工作表,如果您有多个处理器,则可以加快工作速度。

Starting with openpyxl 2.6 you do have the values_only option when reading cells which will speed things up a bit. You can also use multiple processes with read-only mode to read worksheets in parallel, which should speed things up if you have multiple processors.

这篇关于如何加快大型xlsx文件的导入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆