并行读取大文件? [英] Read large file in parallel?

查看:393
本文介绍了并行读取大文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大文件,需要阅读并制作字典.我希望尽快.但是我在python中的代码太慢了.这是一个显示问题的最小示例.

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.

首先制作一些虚假数据

paste <(seq 20000000) <(seq 2 20000001)  > largefile.txt

现在这是读取它并制作字典的最少的python代码.

Now here is a minimal piece of python code to read it in and make a dictionary.

import sys
from collections import defaultdict
fin = open(sys.argv[1])

dict = defaultdict(list)

for line in fin:
    parts = line.split()
    dict[parts[0]].append(parts[1])

时间:

time ./read.py largefile.txt
real    0m55.746s

但是,可以更快地读取整个文件,如下所示:

However it is possible to read the whole file much faster as:

time cut -f1 largefile.txt > /dev/null    
real    0m1.702s

我的CPU有8个内核,是否可以并行化该程序? python来加快速度?

My CPU has 8 cores, is it possible to parallelize this program in python to speed it up?

一种可能是读取大块输入,然后在不同的不重叠子块上并行运行8个进程,从而从内存中的数据并行创建字典,然后读取另一个大块.在python中以某种方式使用多重处理是否可能?

One possibility might be to read in large chunks of the input and then run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel from the data in memory then read in another large chunk. Is this possible in python using multiprocessing somehow?

更新.假数据不是很好,因为每个密钥只有一个值.更好的是

Update. The fake data was not very good as it had only one value per key. Better is

perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt

(与读入大文件并制作字典有关. )

推荐答案

几年前,蒂姆·布雷(Tim Bray)的网站上发表了一篇博客文章"Wide Finder Project",涉及范围广泛[1].您可以从ElementTree [3]和PIL [4]的名声中找到Fredrik Lundh的解决方案[2].我知道通常不建议在此站点发布链接,但是我认为这些链接比复制粘贴他的代码给您更好的答案.

There was a blog post series "Wide Finder Project" several years ago about this at Tim Bray's site [1]. You can find there a solution [2] by Fredrik Lundh of ElementTree [3] and PIL [4] fame. I know posting links is generally discouraged at this site but I think these links give you better answer than copy-pasting his code.

[1] http://www.tbray .org/ongoing/When/200x/2007/10/30/WF-Results
[2] http://effbot.org/zone/wide-finder.htm
[3] http://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/

[1] http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results
[2] http://effbot.org/zone/wide-finder.htm
[3] http://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/

这篇关于并行读取大文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆