在单独的进程之间共享内存中的复杂python对象 [英] Sharing a complex python object in memory between separate processes

查看:108
本文介绍了在单独的进程之间共享内存中的复杂python对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个复杂的python对象,内存大小约为36GB,我想在多个单独的python进程之间共享.它作为一个pickle文件存储在磁盘上,我目前为每个进程分别加载.我想共享此对象,以在可用内存量下并行执行更多进程.

I have a complex python object, of size ~36GB in memory, which I would like to share between multiple separate python processes. It is stored on disk as a pickle file, which I currently load separately for every process. I want to share this object to enable execution of more processes in parallel, under the amount of memory available.

从某种意义上说,该对象用作只读数据库.每个进程每秒都会发起多个访问请求,每个请求仅用于一小部分数据.

This object is used, in a sense, as a read-only database. Every process initiates multiple access requests per second, and every request is just for a small portion of the data.

我研究了Radis之类的解决方案,但最终我发现数据需要序列化为简单的文本形式.同样,将pickle文件本身映射到内存也无济于事,因为每个过程都需要提取该文件.所以我考虑了另外两种可能的解决方案:

I looked into solutions like Radis, but I saw that eventually, the data needs to be serialized into a simple textual form. Also, mapping the pickle file itself to memory should not help because it will need to be extracted by every process. So I thought about two other possible solutions:

  1. 使用共享内存,每个进程都可以访问其中存储对象的地址.这里的问题是该进程只会看到大量字节,无法解释
  2. 编写代码来保存此对象并通过API调用管理数据的检索.在这里,我想知道这种解决方案在速度方面的性能.

是否有一种简单的方法来实现这两个解决方案中的任何一个?也许对于这种情况有更好的解决方案?

Is there a simple way to implement either of these solutions? Perhaps there is a better solution for this situation?

非常感谢!

推荐答案

对于复杂的对象,尚不存在可以直接在进程之间直接共享内存的方法.如果您有简单的ctypes,则可以在c样式的共享内存中执行此操作,但它不会直接映射到python对象.

For complex objects there isn't readily available method to directly share memory between processes. If you have simple ctypes you can do this in a c-style shared memory but it won't map directly to python objects.

如果您一次只需要部分数据,而不是整个36GB,那么有一个简单的解决方案会很好用.为此,您可以使用multiprocessing.managers中的SyncManager.使用此方法,您可以设置一台服务器,该服务器为您的数据提供代理类(您的数据未存储在该类中,代理仅提供对它的访问).然后,您的客户端使用BaseManager连接到服务器,并调用代理类中的方法以检索数据.

There is a simple solution that works well if you only need a portion of your data at any one time, not the entire 36GB. For this you can use a SyncManager from multiprocessing.managers. Using this, you setup a server that serves up a proxy class for your data (your data isn't stored in the class, the proxy only provides access to it). Your client then attaches to the server using a BaseManager and calls methods in the proxy class to retrieve the data.

在幕后,Manager类负责腌制所需的数据,并通过开放的端口将其从服务器发送到客户端.因为您在每次调用时都对数据进行酸洗,所以在需要整个数据集时效率不高.如果您只需要客户端中一小部分数据,则该方法可以节省大量时间,因为数据只需要由服务器加载一次即可.

Behind the scenes the Manager classes take care of pickling the data you ask for and sending it through the open port from server to client. Because you're pickling data with every call this isn't efficient if you need your entire dataset. In the case where you only need a small portion of the data in the client, the method saves a lot of time since the data only needs to be loaded once by the server.

该解决方案在速度方面可与数据库解决方案相提并论,但是如果您希望使用纯Pythonic解决方案,则可以节省很多复杂性和DB学习.

The solution is comparable to a database solution speed-wise but it can save you a lot of complexity and DB-learning if you'd prefer to keep to a purely pythonic solution.

以下是一些示例代码,旨在与GloVe单词向量一起使用.

Here's some example code that is meant to work with GloVe word vectors.

服务器

#!/usr/bin/python
import  sys
from    multiprocessing.managers import SyncManager
import  numpy

# Global for storing the data to be served
gVectors = {}

# Proxy class to be shared with different processes
# Don't but the big vector data in here since that will force it to 
# be piped to the other process when instantiated there, instead just
# return the global vector data, from this process, when requested.
class GloVeProxy(object):
    def __init__(self):
        pass

    def getNVectors(self):
        global gVectors
        return len(gVectors)

    def getEmpty(self):
        global gVectors
        return numpy.zeros_like(gVectors.values()[0])

    def getVector(self, word, default=None):
        global gVectors
        return gVectors.get(word, default)

# Class to encapsulate the server functionality
class GloVeServer(object):
    def __init__(self, port, fname):
        self.port = port
        self.load(fname)

    # Load the vectors into gVectors (global)
    @staticmethod
    def load(filename):
        global gVectors
        f = open(filename, 'r')
        for line in f:
            vals = line.rstrip().split(' ')
            gVectors[vals[0]] = numpy.array(vals[1:]).astype('float32')

    # Run the server
    def run(self):
        class myManager(SyncManager): pass  
        myManager.register('GloVeProxy', GloVeProxy)
        mgr = myManager(address=('', self.port), authkey='GloVeProxy01')
        server = mgr.get_server()
        server.serve_forever()

if __name__ == '__main__':
    port  = 5010
    fname = '/mnt/raid/Data/Misc/GloVe/WikiGiga/glove.6B.50d.txt'

    print 'Loading vector data'
    gs = GloVeServer(port, fname)

    print 'Serving data. Press <ctrl>-c to stop.'
    gs.run()

客户

from   multiprocessing.managers import BaseManager
import psutil   #3rd party module for process info (not strictly required)

# Grab the shared proxy class.  All methods in that class will be availble here
class GloVeClient(object):
    def __init__(self, port):
        assert self._checkForProcess('GloVeServer.py'), 'Must have GloVeServer running'
        class myManager(BaseManager): pass
        myManager.register('GloVeProxy')
        self.mgr = myManager(address=('localhost', port), authkey='GloVeProxy01')
        self.mgr.connect()
        self.glove = self.mgr.GloVeProxy()

    # Return the instance of the proxy class
    @staticmethod
    def getGloVe(port):
        return GloVeClient(port).glove

    # Verify the server is running
    @staticmethod
    def _checkForProcess(name):
        for proc in psutil.process_iter():
            if proc.name() == name:
                return True
        return False

if __name__ == '__main__':
    port = 5010
    glove = GloVeClient.getGloVe(port)

    for word in ['test', 'cat', '123456']:
        print('%s = %s' % (word, glove.getVector(word)))

请注意,psutil库仅用于检查服务器是否正在运行,这不是必需的.请确保将服务器命名为GloVeServer.py或通过代码中的psutil更改检查,以查找正确的名称.

Note that the psutil library is just used to check to see if you have the server running, it's not required. Be sure to name the server GloVeServer.py or change the check by psutil in the code so it looks for the correct name.

这篇关于在单独的进程之间共享内存中的复杂python对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆