在python中高效处理大型.txt文件 [英] Processing a large .txt file in python efficiently

查看:347
本文介绍了在python中高效处理大型.txt文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一般对python和程序设计还很陌生,但是我试图在制表符分隔的.txt文件(包含大约700万行python)上运行滑动窗口"计算.我所说的滑动窗口的意思是它将对例如50,000行进行计算,报告该数字,然后向上移动10,000行,然后对另外50,000行进行相同的计算.我的计算正确,并且滑动窗口"正常运行,如果我在我的一小部分数据上对其进行测试,则它可以很好地运行.但是,如果我尝试在整个数据集上运行该程序,则它的运行速度会非常慢(我现在已经运行了40个小时左右).数学很简单,所以我认为不应该花那么长时间.

I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.

我现在正在读取.txt文件的方法是使用csv.DictReader模块.我的代码如下:

The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:

file1='/Users/Shared/SmallSetbee.txt'
newfile=open(file1, 'rb')
reader=csv.DictReader((line.replace('\0','') for line in newfile), delimiter="\t")

我相信这会立即使所有700万行中的一本成为字典,我想这可能是它对于较大文件而言速度如此之慢的原因.

I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.

由于我只想一次对数据的块"或窗口"运行我的计算,是否有一种更有效的方式一次只能读取指定的行,执行计算,然后重复一次指定行的新指定块"或窗口"?

Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?

推荐答案

collections.deque是项的有序集合,这些项可以采用最大大小.当您将一个项目添加到一端时,一端掉落到另一端.这意味着要遍历csv上的窗口",您只需要继续向deque添加行,它将已经处理掉完整的行.

A collections.deque is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque and it will handle throwing away complete ones already.

dq = collections.deque(maxlen=50000)
with open(...) as csv_file:
    reader = csv.DictReader((line.replace("\0", "") for line in csv_file), delimiter="\t")

    # initial fill
    for _ in range(50000):
        dq.append(reader.next())

    # repeated compute
    try:
        while 1:
            compute(dq)
            for _ in range(10000):
                dq.append(reader.next())
    except StopIteration:
            compute(dq)

这篇关于在python中高效处理大型.txt文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆