如何与 Python 并行读取/处理大文件 [英] How to read / process large files in parallel with Python

查看:141
本文介绍了如何与 Python 并行读取/处理大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个几乎 20GB 的大文件,超过 2000 万 行,每行代表单独的序列化 JSON.

I have a large file almost 20GB, more than 20 mln lines and each line represents separate serialized JSON.

将文件逐行作为常规loop读取并对行数据进行操作需要大量时间.

Reading file line by line as a regular loop and performing manipulation on line data takes a lot of time.

是否有任何最新技术方法或最佳实践并行读取大文件小块为了加快处理速度?

Is there any state of art approach or best practices for reading large files in parallel with smaller chunks in order to make processing faster?

我使用的是 Python 3.6.X

I'm using Python 3.6.X

推荐答案

很遗憾,没有.读取文件并对读取的行进行操作(例如 json 解析或计算)是 CPU 密集型操作,因此没有聪明的 asyncio 策略来加速它.理论上可以利用多处理和多核并行读取和处理,但是多个线程读取同一个文件必然会导致重大问题.由于您的文件太大,将其全部存储在内存中然后并行计算也将很困难.

Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.

您最好的办法是通过将数据(如果可能)分区到多个文件中来解决这个问题,然后可以为多核并行打开更安全的大门.抱歉,AFAIK 没有更好的答案.

Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.

这篇关于如何与 Python 并行读取/处理大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆