使用Python处理大文件[1000 GB或更多] [英] Processing Large Files in Python [ 1000 GB or More]
问题描述
可以说我有一个1000 GB的文本文件.我需要查找一个短语在文本中出现了多少次.
Lets say i have a text file of 1000 GB. I need to find how much times a phrase occurs in the text.
有没有比我正在使用的波纹管更快的方法? 完成任务需要多少钱.
Is there any faster way to do this that the one i am using bellow? How much would it take to complete the task.
phrase = "how fast it is"
count = 0
with open('bigfile.txt') as f:
for line in f:
count += line.count(phrase)
如果我对,如果我的内存中没有此文件,我将不得不等待直到每次执行搜索时PC都加载该文件,这至少需要4000秒才能完成250 MB/秒的速度驱动器和10000 GB的文件.
If I am right if I do not have this file in the memory i would meed to wait till the PC loads the file each time I am doing the search and this should take at least 4000 sec for a 250 MB/sec hard drive and a file of 10000 GB.
推荐答案
我使用file.read()
读取块中的数据,在当前示例中,块的大小分别为100 MB,500MB,1GB和2GB.我的文本文件大小为2.1 GB.
I used file.read()
to read the data in chunks, in current examples the chunks were of size 100 MB, 500MB, 1GB and 2GB respectively. The size of my text file is 2.1 GB.
代码:
from functools import partial
def read_in_chunks(size_in_bytes):
s = 'Lets say i have a text file of 1000 GB'
with open('data.txt', 'r+b') as f:
prev = ''
count = 0
f_read = partial(f.read, size_in_bytes)
for text in iter(f_read, ''):
if not text.endswith('\n'):
# if file contains a partial line at the end, then don't
# use it when counting the substring count.
text, rest = text.rsplit('\n', 1)
# pre-pend the previous partial line if any.
text = prev + text
prev = rest
else:
# if the text ends with a '\n' then simple pre-pend the
# previous partial line.
text = prev + text
prev = ''
count += text.count(s)
count += prev.count(s)
print count
时间:
read_in_chunks(104857600)
$ time python so.py
10000000
real 0m1.649s
user 0m0.977s
sys 0m0.669s
read_in_chunks(524288000)
$ time python so.py
10000000
real 0m1.558s
user 0m0.893s
sys 0m0.646s
read_in_chunks(1073741824)
$ time python so.py
10000000
real 0m1.242s
user 0m0.689s
sys 0m0.549s
read_in_chunks(2147483648)
$ time python so.py
10000000
real 0m0.844s
user 0m0.415s
sys 0m0.408s
另一方面,简单循环版本在我的系统上大约需要6秒钟:
On the other hand the simple loop version takes around 6 seconds on my system:
def simple_loop():
s = 'Lets say i have a text file of 1000 GB'
with open('data.txt') as f:
print sum(line.count(s) for line in f)
$ time python so.py
10000000
real 0m5.993s
user 0m5.679s
sys 0m0.313s
我文件中@SlaterTyranus的 grep
版本的结果:
Results of @SlaterTyranus's grep
version on my file:
$ time grep -o 'Lets say i have a text file of 1000 GB' data.txt|wc -l
10000000
real 0m11.975s
user 0m11.779s
sys 0m0.568s
@woot的解决方案的结果:
$ time cat data.txt | parallel --block 10M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l
10000000
real 0m5.955s
user 0m14.825s
sys 0m5.766s
当我使用100 MB作为块大小时获得了最佳时机:
Got best timing when I used 100 MB as block size:
$ time cat data.txt | parallel --block 100M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l
10000000
real 0m4.632s
user 0m13.466s
sys 0m3.290s
woot的第二种解决方案的结果:
$ time python woot_thread.py # CHUNK_SIZE = 1073741824
10000000
real 0m1.006s
user 0m0.509s
sys 0m2.171s
$ time python woot_thread.py #CHUNK_SIZE = 2147483648
10000000
real 0m1.009s
user 0m0.495s
sys 0m2.144s
系统规格:Core i5-4670、7200 RPM HDD
System Specs: Core i5-4670, 7200 RPM HDD
这篇关于使用Python处理大文件[1000 GB或更多]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!