Openpyxl优化单元格搜索速度 [英] Openpyxl optimizing cells search speed
问题描述
我需要在Excel表单中搜索包含某些模式的单元格。它需要更多的时间比我可以处理。我可以写的最优化的代码如下。由于数据模式通常是行后排,所以我使用iter_rows(row_offset = x)。不幸的是,下面的代码可以在每个for循环中找到给定的模式越来越多的次数(从毫秒开始,最后几分钟)。我做错了什么?
I need to search the Excel sheet for cells containing some pattern. It takes more time than I can handle. The most optimized code I could write is below. Since the data patterns are usually row after row so I use iter_rows(row_offset=x). Unfortunately the code below finds the given pattern an increasing number of times in each for loop (starting from milliseconds and getting up to almost a minute). What am I doing wrong?
import openpyxl
import datetime
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws.title = "test_sheet"
print("Generating quite big excel file")
for i in range(1,10000):
for j in range(1,20):
ws.cell(row = i, column = j).value = "Cell[{},{}]".format(i,j)
print("Saving test excel file")
wb.save('test.xlsx')
def FindXlCell(search_str, last_r):
t = datetime.datetime.utcnow()
for row in ws.iter_rows(row_offset=last_r):
for cell in row:
if (search_str == cell.value):
print(search_str, last_r, cell.row, datetime.datetime.utcnow() - t)
last_r = cell.row
return last_r
print("record not found ",search_str, datetime.datetime.utcnow() - t)
return 1
wb = openpyxl.load_workbook("test.xlsx", data_only=True)
t = datetime.datetime.utcnow()
ws = wb["test_sheet"]
last_row = 1
print("Parsing excel file in a loop for 3 cells")
for i in range(1,100,1):
last_row = FindXlCell("Cell[0,0]", last_row)
last_row = FindXlCell("Cell[1000,6]", last_row)
last_row = FindXlCell("Cell[6000,6]", last_row)
推荐答案
多次遍历工作表是无效的。搜索越来越慢的原因在每个循环中都越来越多地被使用。这是因为 last_row = FindXlCell(Cell [0,0],last_row)
意味着下一个搜索将在行的末尾创建新的单元格:openpyxl创建单元格按需要,因为行可以在技术上是空的,但是它们中的单元格仍然可寻址。在脚本结束时,工作表共有598000行,但您始终从 A1
开始搜索。
Looping over a worksheet multiple times is inefficient. The reason for the search getting progressively slower looks to be increasingly more memory being used in each loop. This is because last_row = FindXlCell("Cell[0,0]", last_row)
means that the next search will create new cells at the end of the rows: openpyxl creates cells on demand because rows can be technically empty but cells in them are still addressable. At the end of your script the worksheet has a total of 598000 rows but you always start searching from A1
.
如果你想多次搜索一个大文件,那么创建一个由文本键入的矩阵可能是有意义的,坐标是值。
If you wish to search a large file for text multiple times then it would probably make sense to create a matrix keyed by the text with the coordinates being the value.
:
matrix = {}
for row in ws:
for cell in row:
matrix[cell.value] = (cell.row, cell.col_idx)
您可能希望使用 defaultdict
来处理具有相同文本的多个单元格的世界示例。
In a real-world example you'd probably want to use a defaultdict
to be able to handle multiple cells with the same text.
这可以与只读模式结合,实现最小的内存占用。当然,如果你想编辑文件,除外。
This could be combined with read-only mode for a minimal memory footprint. Except, of course, if you want to edit the file.
这篇关于Openpyxl优化单元格搜索速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!