Python共享内存字典用于映射大数据 [英] Python Shared Memory Dictionary for Mapping Big Data

查看:93
本文介绍了Python共享内存字典用于映射大数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直很难使用大型字典(〜86GB,17.5亿个键)来通过Python中的多处理来处理大型数据集(2TB).

I've been having a hard time using a large dictionary (~86GB, 1.75 billion keys) to process a big dataset (2TB) using multiprocessing in Python.

上下文:将字符串映射到字符串的字典从已腌制的文件加载到内存中.加载后,将创建工作进程(最好为32个以上),这些工作进程必须在字典中查找值,但修改其内容,以便处理〜2TB数据集.该数据集需要并行处理,否则该任务将花费一个月的时间.

Context: a dictionary mapping strings to strings is loaded from pickled files into memory. Once loaded, worker processes (ideally >32) are created that must lookup values in the dictionary but not modify it's contents, in order to process the ~2TB dataset. The data set needs to be processed in parallel otherwise the task would take over a month.

这是两个 三个 四个 五个 六个 七个 九种方法(都失败了),我尝试过:

Here are the two three four five six seven eight nine approaches (all failing) that I have tried:

  1. 将字典作为全局变量存储在Python程序中,然后派生〜32个辅助进程.从理论上讲,该方法可能有效,因为未对字典进行修改,因此Linux上 fork 的COW机制将意味着数据结构将在进程之间共享而不是复制.但是,当我尝试执行此操作时,我的程序在 multiprocessing.Pool.map 中的 os.fork()上从 OSError崩溃:[Errno 12]无法分配内存.我确信这是因为内核配置为永不过度使用内存(/proc/sys/vm/overcommit_memory 设置为 2 ,而我无法配置因为我没有超级用户访问权限,所以在计算机上没有此设置.

  1. Store the dictionary as a global variable in the Python program and then fork the ~32 worker processes. Theoretically this method might work since the dictionary is not being modified and therefore the COW mechanism of fork on Linux would mean that the data structure would be shared and not copied among processes. However, when I attempt this, my program crashes on os.fork() inside of multiprocessing.Pool.map from OSError: [Errno 12] Cannot allocate memory. I'm convinced that this is because the kernel is configured to never overcommit memory (/proc/sys/vm/overcommit_memory is set to 2, and I can't configure this setting on the machine since I don't have root access).

使用 multiprocessing.Manager.dict 将字典加载到共享内存字典中.通过这种方法,我能够分叉32个工作进程而不会崩溃,但是随后的数据处理比另一个不需要字典的任务版本慢几个数量级(唯一的区别是没有字典查找).我认为这是由于包含字典的管理器进程与每个工作进程之间的进程间通信,这是每个单个字典查找所必需的.尽管未对字典进行修改,但它被访问了很多次,通常由许多进程同时访问.

Load the dictionary into a shared-memory dictionary with multiprocessing.Manager.dict. With this approach I was able to fork the 32 worker process without crashing but the subsequent data processing is orders of magnitude slower than another version of the task that required no dictionary (only difference is no dictionary lookup). I theorize that this is because of the inter-process communication between the manager process containing the dictionary and each worker process, that is required for every single dictionary lookup. Although the dictionary is not being modified, it is being accessed many many times, often simultaneously by many processes.

将字典复制到C ++ std :: map 中,并依靠Linux的COW机制防止其被复制(类似于方法1,但C ++中的字典除外).使用这种方法,花了很长时间才将字典加载到 std :: map 中,随后在 os.fork()上从 ENOMEM 崩溃.和以前一样.

Copy the dictionary into a C++ std::map and rely on Linux's COW mechanism to prevent it from being copied (like approach #1 except with the dictionary in C++). With this approach, it took a long time to load the dictionary into std::map and subsequently crashed from ENOMEM on os.fork() just as before.

将字典复制到 pyshmht 中.将字典复制到 pyshmht 中花费的时间太长.

Copy the dictionary into pyshmht. It takes far too long to copy the dictionary into pyshmht.

尝试使用 SNAP 的HashTable.C ++的基础实现允许在共享内存中进行制作和使用.不幸的是,Python API不提供此功能.

Try using SNAP's HashTable. The underlying implementation in C++ allows for it to be made and used in shared memory. Unfortunately the Python API does not offer this functionality.

使用PyPy.仍然像#1一样发生崩溃.

Use PyPy. Crash still happened as in #1.

multiprocessing.Array 之上,在python中实现我自己的共享内存哈希表.这种方法仍然导致#1中发生内存不足错误.

Implement my own shared-memory hash table in python on top of multiprocessing.Array. This approach still resulted in the out of memory error that ocured in #1.

将字典转储到 dbm .尝试将字典转储到 dbm 数据库四天,并看到预计到达时间为"33天"后,我放弃了这种方法.

Dump the dictionary into dbm. After trying to dump the dictionary into a dbm database for four days and seeing an ETA of "33 days", I gave up on this approach.

将字典转储到Redis中.当我尝试使用 redis.mset 将字典(从1024个较小的dicts加载86GB字典)到Redis时,由于对等错误导致连接重置.当我尝试使用循环转储键值对时,这花费了很长的时间.

Dump the dictionary into Redis. When I try to dump the dictionaries (the 86GB dict is loaded from 1024 smaller dicts) into Redis using redis.mset I get a connection reset by peer error. When I try to dump the key-value pairs using a loop, it takes an extremely long time.

如何有效地并行处理此数据集,而无需进行进程间通信即可在此字典中查找值.欢迎任何解决此问题的建议!

How can I process this dataset in parallel efficiently without requiring inter-process communication in order to lookup values in this dictionary. I would welcome any suggestions for solving this problem!

我正在具有1TB RAM的计算机上使用Ubuntu上的Anaconda的Python 3.6.3.

I'm using Python 3.6.3 from Anaconda on Ubuntu on a machine with 1TB RAM.

修改:最终有效的方法:

我能够使用Redis使其正常工作.为了解决#9中的问题,我不得不将较大的键值插入和查找查询分块为一口大小"的块,以便它仍可分批处理,但不会因太大的查询而超时.这样做可以在45分钟内(具有128个线程和一些负载平衡)执行86GB字典的插入,并且Redis查找查询不会妨碍后续处理的性能(在2天之内完成).

I was able to get this to work using Redis. To get around the issued in #9, I had to chunk the large key-value insertion and lookup queries into "bite-sized" chunks so that it was still processing in batches, but didn't time-out from too large a query. Doing this allowed the insertion of the 86GB dictionary to be performed in 45 minutes (with 128 threads and some load balancing), and the subsequent processing was not hampered in performance by the Redis lookup queries (finished in 2 days).

谢谢大家的帮助和建议.

Thank you all for your help and suggestions.

推荐答案

您可能应该使用一个旨在与许多不同进程(如数据库)共享大量数据的系统.

You should probably use a system that's meant for sharing large amounts of data with many different processes -- like a Database.

获取巨大的数据集并为其创建模式,然后将其转储到数据库中.您甚至可以将其放在单独的计算机上.

Take your giant dataset and create a schema for it and dump it into a database. You could even put it on a separate machine.

然后在任意数量的主机上启动任意数量的进程,以并行处理数据.几乎任何现代数据库都将能够处理负载.

Then launch as many processes as you want, across as many hosts as you want, to process the data in parallel. Pretty much any modern database will be more than capable of handling the load.

这篇关于Python共享内存字典用于映射大数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆