在OpenPYXL中运行50k行Excel文件的最快方法 [英] Fastest Way To Run Through 50k Lines of Excel File in OpenPYXL

查看:104
本文介绍了在OpenPYXL中运行50k行Excel文件的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在python中使用openpyxl,并且尝试运行50k行并从每一行中获取数据并将其放入文件中.但是..我发现的是,我越深入,它的运行速度就会越来越慢.前1k行的运行速度非常快,不到一分钟,但之后的时间就越来越长,接下来的1k行也需要花费时间.

I'm using openpyxl in python, and I'm trying to run through 50k lines and grab data from each row and place it into a file. However.. what I'm finding is it runs incredibely slow the farther I get into it. The first 1k lines goes super fast, less than a minute, but after that it takes longer and longer and longer to do the next 1k lines.

我正在打开.xlsx文件.我想知道以CSV或其他方式打开.txt文件或读取json文件或其他方式是否更快?还是将某种方式转换为阅读速度更快的方式?

I was opening a .xlsx file. I wonder if it is faster to open a .txt file as a csv or something or to read a json file or something? Or to convert somehow to something that will read faster?

我在给定的列中有20个唯一值,然后每个值的值都是随机的.我正在尝试为每个值获取整个唯一值列的字符串.

I have 20 unique values in a given column, and then values are random for each value. I'm trying to grab a string of the entire unique value column for each value.

值1:1243,345,34,124, 值2:1243,345,34,124, 等,等等

Value1: 1243,345,34,124, Value2: 1243,345,34,124, etc, etc

我正在遍历值"列表,查看名称是否存在于文件中(如果存在),则它将访问该文件并将其附加新值,如果文件不存在,它将创建文件,然后将其设置为追加.我有一本字典,其中连接了所有追加写入文件"内容,因此,每当我想编写一些内容时,它都会获取文件名,并且在字典中可以使用追加内容,它将对其进行查找并写入该文件,因此它不会在每次运行时都保持打开新文件的状态.

I'm running through the Value list, seeing if the name exists in a file, if it does, then it will access that file and append to it the new value, if the file doesn't exist, it will create the file and then set it to append. I have a dictionary that has all the "append write file" things connected to it, so anytime I want to write something, it will grab the file name, and the append thing will be available in the dict, it will look it up and write to that file, so it doesn't keep opening new files everytime it runs.

第一个1k用了不到一分钟..现在我正在4k到5k记录上,它已经准备好了5分钟..随着记录的增加,它似乎需要更长的时间,我想知道如何加快它的速度向上.它根本不打印到控制台.

The first 1k took less than a minute.. now I'm on 4k to 5k records, and it's running all ready 5 minutes.. it seems to take longer as it goes up in records, I wonder how to speed it up. It's not printing to the console at all.

writeFile = 1
theDict = {}

for row in ws.iter_rows(rowRange):
    for cell in row:
        #grabbing the value
        theStringValueLocation = "B" + str(counter)
        theValue = ws[theStringValueLocation].value
        theName = cell.value
        textfilename = theName + ".txt"

        if os.path.isfile(textfilename):
            listToAddTo = theDict[theName]
            listToAddTo.write("," + theValue)
            if counter == 1000:
                print "1000"
                st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')

        else:
            writeFileName = open(textfilename, 'w')
            writeFileName.write(theValue)
            writeFileName = open(textfilename, 'a')
            theDict[theName] = writeFileName
        counter = counter + 1

我在上面的代码中添加了一些时间戳,它不存在,但是您可以在下面看到输出.我看到的问题是每运行1k,它就会越来越高.第一时间2分钟,先3分钟,再5分钟,再7分钟.到它达到5万时,我担心它会花费一个小时左右,甚至花费很长时间.

I added some time stamps to the above code, it is not there, but you can see the output below. The problem I'm seeing is that it is going up higher and higher each 1k run. 2 minutes the firs ttime, thne 3 minutes, then 5 minutes, then 7 minutes. By the time it hits 50k, I'm worried it's going to be taking an hour or something and it will be taking too long.

1000
2016-02-25 15:15:08
20002016-02-25 15:17:07
30002016-02-25 15:20:52
2016-02-25 15:25:28
4000
2016-02-25 15:32:00
5000
2016-02-25 15:40:02
6000
2016-02-25 15:51:34
7000
2016-02-25 16:03:29
8000
2016-02-25 16:18:52
9000
2016-02-25 16:35:30
10000

我应该弄清楚的事情..我不提前知道值的名称,也许我应该遍历并在单独的python脚本中获取它们,以使此过程更快?

Somethings I should make clear.. I don't know the names of the values ahead of time, maybe I should run through and grab those in a seperate python script to make this go faster?

第二,我需要一个由逗号分隔的所有值的字符串,这就是为什么我将其放入文本文件中以便稍后进行抓取的原因.我当时想按照建议的清单进行操作,但是我想知道是否会遇到同样的问题.我认为问题与阅读excel有关.无论如何,我都可以从中得到一个用逗号分隔的字符串,我可以用另一种方式来做到这一点.

Second, I need a string of all values seperated by comma, that's why I put it into a text file to grab later. I was thinking of doing it by a list as was suggested to me, but I'm wondering if that will have the same problem. I'm thinking the problem has to do with reading off excel. Anyway I can get a string out of it seperated by comma, I can do it another way.

或者也许我可以尝试/捕获而不是每次都搜索文件,并且如果有错误,我可以假设创建一个新文件吗?也许每次查找都使查找过程变得很慢?如果文件存在?

Or maybe I could do try/catch instead of searching for the file everytime, and if there is an error, I can assume to create a new file? Maybe the lookup everytime is making it go really slow? the If the file exists?

这个问题是我原来的问题的继续,我从那里得到了一些建议....

this question is a continuation from my original here and I took some suggestions from there.... What is the fastest performance tuple for large data sets in python?

推荐答案

我认为您要执行的操作是从行的B列中获取一个密钥,并将其用作要附加到文件名的键.让我们加快很多速度:

I think what you're trying to do is get a key out of column B of the row, and use that for the filename to append to. Let's speed it up a lot:

from collections import defaultdict
Value_entries = defaultdict(list) # dict of lists of row data

for row in ws.iter_rows(rowRange):
    key = row[1].value

    Value_entries[key].extend([cell.value for cell in row])

# All done. Now write files:
for key in Value_entries.keys():
    with open(key + '.txt', 'w') as f:
        f.write(','.join(Value_entries[key]))

这篇关于在OpenPYXL中运行50k行Excel文件的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆