Python:islice的性能问题 [英] Python: performance issues with islice

查看:223
本文介绍了Python:islice的性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用以下代码,我看到更长和更长的执行时间,因为我增加了islice的开始行。例如,start_row为4将在1秒内执行,但start_row为500004将需要11秒。为什么会发生这种情况,是否有更快的方法来做到这一点?我希望能够遍历大型CSV文件(几GB)中的几行范围,并进行一些计算。

With the following code, I'm seeing longer and longer execution times as I increase the starting row in islice. For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s. Why does this happen and is there a faster way to do this? I want to be able to iterate over several ranges of rows in a large CSV file (several GB) and make some calculations.

import csv
import itertools
from collections import deque
import time

my_queue = deque()

start_row = 500004
stop_row = start_row + 50000

with open('test.csv', 'rb') as fin:
    #load into csv's reader
    csv_f = csv.reader(fin)

    #start logging time for performance
    start = time.time()

    for row in itertools.islice(csv_f, start_row, stop_row):
        my_queue.append(float(row[4])*float(row[10]))

    #stop logging time
    end = time.time()
    #display performance
    print "Initial queue populating time: %.2f" % (end-start)


推荐答案


例如,start_row为4将在1s执行,但start_row
500004将需要11s

For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s

这是islice智能化。或者懒惰,取决于你喜欢什么。

That is islice being intelligent. Or lazy, depending on which term you prefer.

这是文件是刚字节的字节在你的硬盘上。他们没有任何内部组织。 \\\
只是该长,长字符串中的另一组字节。 没有办法访问任何特定的行,除非查看之前的所有信息(除非您的行具有完全相同的长度,在这种情况下,您可以使用文件。 seek )。

Thing is, files are "just" strings of bytes on your hard drive. They don't have any internal organization. \n is just another set of bytes in that long, long string. There is no way to access any particular line without looking at all of the information before it (unless your lines are of the exact same length, in which case you can use file.seek).

第4行?查找第4行很快,您的计算机只需要找到3 \\\
。行50004?您的计算机必须读取文件,直到找到500003 \\\
。没有办法,如果有人告诉你,否则,他们有一些其他种类的量子计算机,或者他们的电脑正在阅读的文件,就像世界上其他计算机,只是在他们的背后。

Line 4? Finding line 4 is fast, your computer just needs to find 3 \n. Line 50004? Your computer has to read through the file until it finds 500003 \n. No way around it, and if someone tells you otherwise, they either have some other sort of quantum computer or their computer is reading through the file just like every other computer in the world, just behind their back.

至于你能做什么:当试图抓住线条来迭代时,尝试聪明。聪明,懒惰。安排您的请求,使您只需遍历该文件一次,并在您提取所需的数据后立即关闭该文件。

As for what you can do about it: Try to be smart when trying to grab lines to iterate over. Smart, and lazy. Arrange your requests so you're only iterating through the file once, and close the file as soon as you've pulled the data you need. (islice does all of this, by the way.)

在python中

lines_I_want = [(start1, stop1), (start2, stop2),...]
with f as open(filename):
     for i,j in enumerate(f):
          if i >= lines_I_want[0][0]:
              if i >= lines_I_want[0][1]:
                   lines_I_want.pop(0)
                   if not lines_I_want: #list is empty
                         break
              else:
                   #j is a line I want. Do something

如果你有任何控制,使每一行的长度相同,可以 seek 。或者使用数据库。

And if you have any control over making that file, make every line the same length so you can seek. Or use a database.

这篇关于Python:islice的性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆