使用Python处理大文件[1000 GB或更多] [英] Processing Large Files in Python [ 1000 GB or More]

查看:348
本文介绍了使用Python处理大文件[1000 GB或更多]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以说我有一个1000 GB的文本文件.我需要查找一个短语在文本中出现了多少次.

Lets say i have a text file of 1000 GB. I need to find how much times a phrase occurs in the text.

有没有比我正在使用的波纹管更快的方法? 完成任务需要多少钱.

Is there any faster way to do this that the one i am using bellow? How much would it take to complete the task.

phrase = "how fast it is"
count = 0
with open('bigfile.txt') as f:
    for line in f:
        count += line.count(phrase)

如果我对,如果我的内存中没有此文件,我将不得不等待直到每次执行搜索时PC都加载该文件,这至少需要4000秒才能完成250 MB/秒的速度驱动器和10000 GB的文件.

If I am right if I do not have this file in the memory i would meed to wait till the PC loads the file each time I am doing the search and this should take at least 4000 sec for a 250 MB/sec hard drive and a file of 10000 GB.

推荐答案

我使用file.read()读取块中的数据,在当前示例中,块的大小分别为100 MB,500MB,1GB和2GB.我的文本文件大小为2.1 GB.

I used file.read() to read the data in chunks, in current examples the chunks were of size 100 MB, 500MB, 1GB and 2GB respectively. The size of my text file is 2.1 GB.

代码:

 from functools import partial

 def read_in_chunks(size_in_bytes):

    s = 'Lets say i have a text file of 1000 GB'
    with open('data.txt', 'r+b') as f:
        prev = ''
        count = 0
        f_read  = partial(f.read, size_in_bytes)
        for text in iter(f_read, ''):
            if not text.endswith('\n'):
                # if file contains a partial line at the end, then don't
                # use it when counting the substring count. 
                text, rest = text.rsplit('\n', 1)
                # pre-pend the previous partial line if any.
                text =  prev + text
                prev = rest
            else:
                # if the text ends with a '\n' then simple pre-pend the
                # previous partial line. 
                text =  prev + text
                prev = ''
            count += text.count(s)
        count += prev.count(s)
        print count

时间:

read_in_chunks(104857600)
$ time python so.py
10000000

real    0m1.649s
user    0m0.977s
sys     0m0.669s

read_in_chunks(524288000)
$ time python so.py
10000000

real    0m1.558s
user    0m0.893s
sys     0m0.646s

read_in_chunks(1073741824)
$ time python so.py
10000000

real    0m1.242s
user    0m0.689s
sys     0m0.549s


read_in_chunks(2147483648)
$ time python so.py
10000000

real    0m0.844s
user    0m0.415s
sys     0m0.408s

另一方面,简单循环版本在我的系统上大约需要6秒钟:

On the other hand the simple loop version takes around 6 seconds on my system:

def simple_loop():

    s = 'Lets say i have a text file of 1000 GB'
    with open('data.txt') as f:
        print sum(line.count(s) for line in f)

$ time python so.py
10000000

real    0m5.993s
user    0m5.679s
sys     0m0.313s


我文件中@SlaterTyranus的 grep版本的结果:


Results of @SlaterTyranus's grep version on my file:

$ time grep -o 'Lets say i have a text file of 1000 GB' data.txt|wc -l
10000000

real    0m11.975s
user    0m11.779s
sys     0m0.568s


@woot的解决方案的结果:

$ time cat data.txt | parallel --block 10M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l
10000000

real    0m5.955s
user    0m14.825s
sys     0m5.766s

当我使用100 MB作为块大小时获得了最佳时机:

Got best timing when I used 100 MB as block size:

$ time cat data.txt | parallel --block 100M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l
10000000

real    0m4.632s
user    0m13.466s
sys     0m3.290s


woot的第二种解决方案的结果:

$ time python woot_thread.py # CHUNK_SIZE = 1073741824
10000000

real    0m1.006s
user    0m0.509s
sys     0m2.171s
$ time python woot_thread.py  #CHUNK_SIZE = 2147483648
10000000

real    0m1.009s
user    0m0.495s
sys     0m2.144s

系统规格:Core i5-4670、7200 RPM HDD

System Specs: Core i5-4670, 7200 RPM HDD

这篇关于使用Python处理大文件[1000 GB或更多]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆