在单独的进程之间共享内存中的复杂python对象 [英] Sharing a complex python object in memory between separate processes
问题描述
我有一个复杂的python对象,内存大小约为36GB,我想在多个单独的python进程之间共享.它作为一个pickle文件存储在磁盘上,我目前为每个进程分别加载.我想共享此对象,以在可用内存量下并行执行更多进程.
I have a complex python object, of size ~36GB in memory, which I would like to share between multiple separate python processes. It is stored on disk as a pickle file, which I currently load separately for every process. I want to share this object to enable execution of more processes in parallel, under the amount of memory available.
从某种意义上说,该对象用作只读数据库.每个进程每秒都会发起多个访问请求,每个请求仅用于一小部分数据.
This object is used, in a sense, as a read-only database. Every process initiates multiple access requests per second, and every request is just for a small portion of the data.
我研究了Radis之类的解决方案,但最终我发现数据需要序列化为简单的文本形式.同样,将pickle文件本身映射到内存也无济于事,因为每个过程都需要提取该文件.所以我考虑了另外两种可能的解决方案:
I looked into solutions like Radis, but I saw that eventually, the data needs to be serialized into a simple textual form. Also, mapping the pickle file itself to memory should not help because it will need to be extracted by every process. So I thought about two other possible solutions:
- 使用共享内存,每个进程都可以访问其中存储对象的地址.这里的问题是该进程只会看到大量字节,无法解释
- 编写代码来保存此对象并通过API调用管理数据的检索.在这里,我想知道这种解决方案在速度方面的性能.
是否有一种简单的方法来实现这两个解决方案中的任何一个?也许对于这种情况有更好的解决方案?
Is there a simple way to implement either of these solutions? Perhaps there is a better solution for this situation?
非常感谢!
推荐答案
对于复杂的对象,尚不存在可以直接在进程之间直接共享内存的方法.如果您有简单的ctypes
,则可以在c样式的共享内存中执行此操作,但它不会直接映射到python对象.
For complex objects there isn't readily available method to directly share memory between processes. If you have simple ctypes
you can do this in a c-style shared memory but it won't map directly to python objects.
如果您一次只需要部分数据,而不是整个36GB,那么有一个简单的解决方案会很好用.为此,您可以使用multiprocessing.managers
中的SyncManager
.使用此方法,您可以设置一台服务器,该服务器为您的数据提供代理类(您的数据未存储在该类中,代理仅提供对它的访问).然后,您的客户端使用BaseManager
连接到服务器,并调用代理类中的方法以检索数据.
There is a simple solution that works well if you only need a portion of your data at any one time, not the entire 36GB. For this you can use a SyncManager
from multiprocessing.managers
. Using this, you setup a server that serves up a proxy class for your data (your data isn't stored in the class, the proxy only provides access to it). Your client then attaches to the server using a BaseManager
and calls methods in the proxy class to retrieve the data.
在幕后,Manager
类负责腌制所需的数据,并通过开放的端口将其从服务器发送到客户端.因为您在每次调用时都对数据进行酸洗,所以在需要整个数据集时效率不高.如果您只需要客户端中一小部分数据,则该方法可以节省大量时间,因为数据只需要由服务器加载一次即可.
Behind the scenes the Manager
classes take care of pickling the data you ask for and sending it through the open port from server to client. Because you're pickling data with every call this isn't efficient if you need your entire dataset. In the case where you only need a small portion of the data in the client, the method saves a lot of time since the data only needs to be loaded once by the server.
该解决方案在速度方面可与数据库解决方案相提并论,但是如果您希望使用纯Pythonic解决方案,则可以节省很多复杂性和DB学习.
The solution is comparable to a database solution speed-wise but it can save you a lot of complexity and DB-learning if you'd prefer to keep to a purely pythonic solution.
以下是一些示例代码,旨在与GloVe单词向量一起使用.
Here's some example code that is meant to work with GloVe word vectors.
服务器
#!/usr/bin/python
import sys
from multiprocessing.managers import SyncManager
import numpy
# Global for storing the data to be served
gVectors = {}
# Proxy class to be shared with different processes
# Don't but the big vector data in here since that will force it to
# be piped to the other process when instantiated there, instead just
# return the global vector data, from this process, when requested.
class GloVeProxy(object):
def __init__(self):
pass
def getNVectors(self):
global gVectors
return len(gVectors)
def getEmpty(self):
global gVectors
return numpy.zeros_like(gVectors.values()[0])
def getVector(self, word, default=None):
global gVectors
return gVectors.get(word, default)
# Class to encapsulate the server functionality
class GloVeServer(object):
def __init__(self, port, fname):
self.port = port
self.load(fname)
# Load the vectors into gVectors (global)
@staticmethod
def load(filename):
global gVectors
f = open(filename, 'r')
for line in f:
vals = line.rstrip().split(' ')
gVectors[vals[0]] = numpy.array(vals[1:]).astype('float32')
# Run the server
def run(self):
class myManager(SyncManager): pass
myManager.register('GloVeProxy', GloVeProxy)
mgr = myManager(address=('', self.port), authkey='GloVeProxy01')
server = mgr.get_server()
server.serve_forever()
if __name__ == '__main__':
port = 5010
fname = '/mnt/raid/Data/Misc/GloVe/WikiGiga/glove.6B.50d.txt'
print 'Loading vector data'
gs = GloVeServer(port, fname)
print 'Serving data. Press <ctrl>-c to stop.'
gs.run()
客户
from multiprocessing.managers import BaseManager
import psutil #3rd party module for process info (not strictly required)
# Grab the shared proxy class. All methods in that class will be availble here
class GloVeClient(object):
def __init__(self, port):
assert self._checkForProcess('GloVeServer.py'), 'Must have GloVeServer running'
class myManager(BaseManager): pass
myManager.register('GloVeProxy')
self.mgr = myManager(address=('localhost', port), authkey='GloVeProxy01')
self.mgr.connect()
self.glove = self.mgr.GloVeProxy()
# Return the instance of the proxy class
@staticmethod
def getGloVe(port):
return GloVeClient(port).glove
# Verify the server is running
@staticmethod
def _checkForProcess(name):
for proc in psutil.process_iter():
if proc.name() == name:
return True
return False
if __name__ == '__main__':
port = 5010
glove = GloVeClient.getGloVe(port)
for word in ['test', 'cat', '123456']:
print('%s = %s' % (word, glove.getVector(word)))
请注意,psutil
库仅用于检查服务器是否正在运行,这不是必需的.请确保将服务器命名为GloVeServer.py
或通过代码中的psutil
更改检查,以查找正确的名称.
Note that the psutil
library is just used to check to see if you have the server running, it's not required. Be sure to name the server GloVeServer.py
or change the check by psutil
in the code so it looks for the correct name.
这篇关于在单独的进程之间共享内存中的复杂python对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!