使用openpyxl和大数据的内存错误表现出色 [英] Memory error using openpyxl and large data excels

查看:93
本文介绍了使用openpyxl和大数据的内存错误表现出色的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个脚本,该脚本必须从一个文件夹(大约10,000个)中读取很多excel文件.此脚本加载excel文件(其中一些具有2,000多行),并读取一列以计算行数(检查内容).如果行数不等于给定数,它将在日志中写入警告.

I have written a script which has to read lot of excel files from a folder (around 10,000). This script loads the excel file (some of them has more than 2,000 rows) and reads one column to count the number of rows (checking stuff). If the number of rows is not equal to a given number, it writes the warning in a log.

当脚本读取1000个以上的excel文件时,就会出现问题.然后是它引发内存错误的时候,我不知道问题可能出在哪里.以前,该脚本读取两个具有14,000行的csv文件并将其存储在列表中.这些列表包含excel文件及其各自的行数的标识符.如果此行数不等于excel文件的行数,则会写入警告.读取这些列表可能是问题吗?

The problem comes when the script reads more than 1,000 excel files. It's then when it throws memory error, and I don't know where could be the problem. Previously, the script read two csv file with 14,000 rows and stores it in a list. These lists contain an identificator for the excel file and its respective number of rows. If this number of rows is not equal to the number of rows of the excel file, it writes the warning. Could be the problem the reading of these lists?

我正在使用openpyxl加载工作簿,在打开下一个工作簿之前是否需要关闭它们?

I'm using openpyxl to load the workbooks, do I need to close them before open the next?

这是我的代码:

# -*- coding: utf-8 -*-

import os
from openpyxl import Workbook
import glob
import time
import csv
from time import gmtime,strftime
from openpyxl import load_workbook

folder = ''
conditions = 0
a = 0
flight_error = 0
condition_error = 0
typical_flight_error = 0
SP_error = 0


cond_numbers = []
with open('Conditions.csv','rb') as csv_name:           # Abre el fichero csv donde estarán las equivalencias   
    csv_read = csv.reader(csv_name,delimiter='\t')

    for reads in csv_read:
        cond_numbers.append(reads)

flight_TF = []
with open('vuelo-TF.csv','rb') as vuelo_TF:
    csv_read = csv.reader(vuelo_TF,delimiter=';')

    for reads in csv_read:
        flight_TF.append(reads)


excel_files = glob.glob('*.xlsx')

for excel in excel_files:
    print "Leyendo excel: "+excel

    wb = load_workbook(excel)
    ws = wb.get_sheet_by_name('Control System')
    flight = ws.cell('A7').value
    typical_flight = ws.cell('B7').value
    a = 0

    for row in range(6,ws.get_highest_row()):
        conditions = conditions + 1


        value_flight = int(ws.cell(row=row,column=0).value)
        value_TF = ws.cell(row=row,column=1).value
        value_SP = int(ws.cell(row=row,column=4).value)

        if value_flight == '':
            break

        if value_flight != flight:
            flight_error = 1                # Si no todos los flight numbers dentro del vuelo son iguales

        if value_TF != typical_flight:
            typical_flight_error = 2            # Si no todos los typical flight dentro del vuelo son iguales

        if value_SP != 100:
            SP_error = 1



    for cond in cond_numbers:
        if int(flight) == int(cond[0]):
            conds = int(cond[1])
            if conds != int(conditions):
                condition_error = 1         # Si el número de condiciones no se corresponde con el esperado

    for vuelo_TF in flight_TF:
        if int(vuelo_TF[0]) == int(flight):
            TF = vuelo_TF[1]
            if typical_flight != TF:
                typical_flight_error = 1        # Si el vuelo no coincide con el respectivo typical flight

    if flight_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+':  Los flight numbers del vuelo '+str(flight)+' no coinciden.\n'
        log.write(message)
        log.close()
        flight_error = 0

    if condition_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': El número de condiciones del vuelo '+str(flight)+' no coincide. Condiciones esperadas: '+str(int(conds))+'. Condiciones obtenidas: '+str(int(conditions))+'.\n'
        log.write(message)
        log.close()
        condition_error = 0

    if typical_flight_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': El vuelo '+str(flight)+' no coincide con el typical flight. Typical flight respectivo: '+TF+'. Typical flight obtenido: '+typical_flight+'.\n'
        log.write(message)
        log.close() 
        typical_flight_error = 0

    if typical_flight_error == 2:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': Los typical flight del vuelo '+str(flight)+' no son todos iguales.\n'
        log.write(message)
        log.close()
        typical_flight_error = 0

    if SP_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': Hay algún Step Percentage del vuelo '+str(flight)+' menor que 100.\n'
        log.write(message)
        log.close()
        SP_error = 0

    conditions = 0

结尾处的if语句用于检查和写入警告日志.

The if statements of the end are for checking and writing warning logs.

我正在使用带有8 GB RAM的Windows XP和Intel Xeon W3505(两个内核,2.53 GHz).

I'm using windows xp with 8 gb RAM and intel xeon w3505 (two cores, 2,53 GHz).

推荐答案

openpyxl的默认实现会将所有访问的单元存储到内存中.我建议您使用优化阅读器(链接- https://openpyxl.readthedocs. org/en/latest/optimized.html )代替

The default implementation of openpyxl will store all the accessed cells into memory. I will suggest you to use the Optimized reader (link - https://openpyxl.readthedocs.org/en/latest/optimized.html) instead

在代码中:-

wb = load_workbook(file_path, use_iterators = True)

在加载工作簿传递use_iterators = True时.然后访问工作表和单元格,如:

While loading a workbook pass use_iterators = True. Then access the sheet and cells like:

for row in sheet.iter_rows():
    for cell in row:
        cell_text = cell.value

这会将内存占用减少到5-10%

This will reduce the memory footprint to 5-10%

更新:在2.4.0版中,use_iterators = True选项已删除.在较新的版本中,引入了openpyxl.writer.write_only.WriteOnlyWorksheet来转储大量数据.

UPDATE: In version 2.4.0 use_iterators = True option is removed. In newer versions openpyxl.writer.write_only.WriteOnlyWorksheet is introduced for dumping large amounts of data.

from openpyxl import Workbook
wb = Workbook(write_only=True)
ws = wb.create_sheet()

# now we'll fill it with 100 rows x 200 columns
for irow in range(100):
    ws.append(['%d' % i for i in range(200)])

# save the file
wb.save('new_big_file.xlsx') 

未经测试的以下代码只是从上面的链接复制而来.

Not tested the below code just copied from the above link.

感谢@SdaliM提供信息.

Thanks @SdaliM for the information.

这篇关于使用openpyxl和大数据的内存错误表现出色的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆