使用 python 3.6 将多个文件并行加载到内存中的最佳方法是什么? [英] What is the best way to load multiple files into memory in parallel using python 3.6?

查看:79
本文介绍了使用 python 3.6 将多个文件并行加载到内存中的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 6 个大文件,每个文件都包含一个我使用 pickle 函数保存在硬盘中的字典对象.按顺序加载所有这些大约需要 600 秒.我想同时开始加载所有这些文件以加快进程.假设它们都具有相同的大小,我希望在 100 秒内加载它们.我使用 multiprocessing 和 apply_async 分别加载它们中的每一个,但它像顺序一样运行.这是我使用的代码,它不起作用.该代码适用于其中 3 个文件,但其中 6 个文件的代码相同.我将第三个文件放在另一个硬盘中以确保 IO 不受限制.

I have 6 large files which each of them contains a dictionary object that I saved in a hard disk using pickle function. It takes about 600 seconds to load all of them in sequential order. I want to start loading all them at the same time to speed up the process. Suppose all of them have the same size, I hope to load them in 100 seconds instead. I used multiprocessing and apply_async to load each of them separately but it runs like sequential. This is the code I used and it doesn't work. The code is for 3 of these files but it would be the same for six of them. I put the 3rd file in another hard disk to make sure the IO is not limited.

def loadMaps():    
    start = timeit.default_timer()
    procs = []
    pool = Pool(3)
    pool.apply_async(load1(),)
    pool.apply_async(load2(),)
    pool.apply_async(load3(),)
    pool.close()
    pool.join()
    stop = timeit.default_timer()
    print('loadFiles takes in %.1f seconds' % (stop - start))

推荐答案

如果您的代码主要受 IO 限制并且文件位于多个磁盘上,您可能可以使用线程来加快速度:

If your code is primarily limited by IO and the files are on multiple disks, you might be able to speed it up using threads:

import concurrent.futures
import pickle

def read_one(fname):
    with open(fname, 'rb') as f:
        return pickle.load(f)

def read_parallel(file_names):
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(read_one, f) for f in file_names]
        return [fut.result() for fut in futures]

GIL 不会强制 IO 操作序列化,因为 Python 在执行时始终会释放它IO.

The GIL will not force IO operations to run serialized because Python consistently releases it when doing IO.

关于替代方案的几点说明:

Several remarks on alternatives:

  • multiprocessing 不太可能有帮助,因为虽然它保证在多个进程中完成其工作(因此没有 GIL),但它也需要在子进程之间传输内容和主进程,这需要额外的时间.

  • multiprocessing is unlikely to help because, while it guarantees to do its work in multiple processes (and therefore free of the GIL), it also requires the content to be transferred between the subprocess and the main process, which takes additional time.

asyncio 根本不会帮助你,因为它本身不支持异步文件系统访问(流行的操作系统也不支持).虽然它可以用线程模拟,但效果和上面的代码一样,只是多了很多更多的仪式.

asyncio will not help you at all because it doesn't natively support asynchronous file system access (and neither do the popular OS'es). While it can emulate it with threads, the effect is the same as the code above, only with much more ceremony.

这两个选项都不会将六个文件的加载速度提高六倍.考虑到至少一些时间用于创建字典,这些字典将被 GIL 序列化.如果您想真正加快启动速度,更好的方法是不要预先创建整个字典并切换到 文件内数据库,可能使用字典来缓存对其内容的访问.

Neither option will speed up loading the six files by a factor of six. Consider that at least some of the time is spent creating the dictionaries, which will be serialized by the GIL. If you want to really speed up startup, a better approach is not to create the whole dictionary upfront and switch to an in-file database, possibly using the dictionary to cache access to its content.

这篇关于使用 python 3.6 将多个文件并行加载到内存中的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆