与xlrd相比,使用openpyxl读取Excel文件的幅度要慢 [英] Reading Excel file is magnitudes slower using openpyxl compared to xlrd

查看:101
本文介绍了与xlrd相比,使用openpyxl读取Excel文件的幅度要慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Excel电子表格,我需要每天将其导入到SQL Server中.电子表格将包含约250,000行,约50列.我已经使用几乎相同的代码使用 openpyxl xlrd 进行了测试.

I have an Excel spreadsheet that I need to import into SQL Server on a daily basis. The spreadsheet will contain around 250,000 rows across around 50 columns. I have tested both using openpyxl and xlrd using nearly identical code.

这是我正在使用的代码(减去调试语句):

Here's the code I'm using (minus debugging statements):

import xlrd
import openpyxl

def UseXlrd(file_name):
    workbook = xlrd.open_workbook(file_name, on_demand=True)
    worksheet = workbook.sheet_by_index(0)
    first_row = []
    for col in range(worksheet.ncols):
        first_row.append(worksheet.cell_value(0,col))
    data = []
    for row in range(1, worksheet.nrows):
        record = {}
        for col in range(worksheet.ncols):
            if isinstance(worksheet.cell_value(row,col), str):
                record[first_row[col]] = worksheet.cell_value(row,col).strip()
            else:
                record[first_row[col]] = worksheet.cell_value(row,col)
        data.append(record)
    return data


def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    first_row = []
    for col in range(1,sheet.max_column+1):
        first_row.append(sheet.cell(row=1,column=col).value)
    data = []
    for r in range(2,sheet.max_row+1):
        record = {}
        for col in range(sheet.max_column):
            if isinstance(sheet.cell(row=r,column=col+1).value, str):
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
            else:
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value
        data.append(record)
    return data

xlrd_results = UseXlrd('foo.xls')
openpyxl_resuts = UseOpenpyxl('foo.xls')

传递包含3500行的同一个Excel文件将产生截然不同的运行时间.使用xlrd,我可以在2秒内将整个文件读入词典列表中.使用openpyxl我得到以下结果:

Passing the same Excel file containing 3500 rows gives drastically different run times. Using xlrd I can read the entire file into a list of dictionaries in under 2 second. Using openpyxl I get the following results:

Reading Excel File...
Read 100 lines in 114.14509415626526 seconds
Read 200 lines in 471.43183994293213 seconds
Read 300 lines in 982.5288782119751 seconds
Read 400 lines in 1729.3348784446716 seconds
Read 500 lines in 2774.886833190918 seconds
Read 600 lines in 4384.074863195419 seconds
Read 700 lines in 6396.7723388671875 seconds
Read 800 lines in 7998.775000572205 seconds
Read 900 lines in 11018.460735321045 seconds

虽然我可以在最终脚本中使用xlrd,但由于各种问题(例如,int读取为float,date读取为int,datetime读取为float),我将不得不对很多格式进行硬编码.由于我需要将此代码重用于更多导入,因此尝试对特定的列进行硬编码以正确格式化它们并必须在4个不同的脚本中维护相似的代码是没有意义的.

While I can use xlrd in the final script, I will have to hard code a lot of formatting because of various issues (i.e. int reads as float, date reads as int, datetime reads as float). Being that I need to reuse this code for a few more imports, it doesn't make sense to try and hard code specific columns to format them properly and have to maintain similar code across 4 different scripts.

关于如何进行的任何建议?

Any advice on how to proceed?

推荐答案

您可以迭代在工作表上:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [cell.value for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = cell.value
        data.append(record)
    return data

这应该扩展到大文件.您可能希望对结果进行分块,如果列表 data太大.

This should scale to large files. You may want to chunk your result if the list data gets too large.

现在,openpyxl版本的时间大约是xlrd版本的两倍:

Now the openpyxl version takes about twice as long as the xlrd one:

%timeit xlrd_results = UseXlrd('foo.xlsx')
1 loops, best of 3: 3.38 s per loop

%timeit openpyxl_results = UseOpenpyxl('foo.xlsx')
1 loops, best of 3: 6.87 s per loop

请注意,xlrd和openpyxl可能对整数和浮点数的解释稍有不同.对于我的测试数据,我需要添加float()以使输出具有可比性:

Note that xlrd and openpyxl might interpret what is an integer and what is a float slightly differently. For my test data, I needed to add float() to make the outputs comparable:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [float(cell.value) for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = float(cell.value)
        data.append(record)
    return data

现在,两个版本的测试数据都具有相同的结果:

Now, both versions give the same results for my test data:

>>> xlrd_results == openpyxl_results
True

这篇关于与xlrd相比,使用openpyxl读取Excel文件的幅度要慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆