共享内存复杂的可写数据结构 [英] Shared-memory complex writable data structures

查看:103
本文介绍了共享内存复杂的可写数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个对大型图结构进行操作的算法,我想对它进行多线程处理以提高性能.我看过的所有方法都不符合我想要的方法:我希望图形存在于共享内存中,所有进程都可以读取和写入该内存(使用锁来防止竞争条件).本质上,我想要的行为类似于C中的OpenMP,每个线程都可以访问所有内存.

I have an algorithm that operates on large graph structure that I'd like to make multithreaded for better performance. None of the methods I've looked at quite fit what I want: I would like the graph to exist in shared memory which all of the processes can read and write to (using locks to prevent race conditions). Essentially, I would like something that behaves like OpenMP in C, where all the memory is accessible by each thread.

我从查看线程模块开始,但是GIL表示性能提升微不足道.

I started by looking at the threading module, but the GIL means that the performance increase is insignificant.

正如我在该主题上发现的大多数帖子所建议的那样,我继续尝试多处理模块(例如 python多重处理中的共享内存对象).这样做有两个主要问题.

I proceeded to try the multiprocessing module, as suggested by most of the posts I've found on this topic (e.g. how can I share a dictionary across multiple processes? and Shared-memory objects in python multiprocessing). There are two main problems with this.

首先,似乎多处理不适用于复杂的对象.考虑以下玩具问题:我有一个整数列表,想将它们全部乘以10,然后以任意顺序输出所有数字.我可以使用以下代码:

First, it seems as though multiprocessing doesn't work well with complicated objects. Consider the following toy problem: I have a list of integers and would like to multiply all of them by 10, then output all the numbers in arbitrary order. I can use the following code:

def multiply_list():
    manager = Manager()
    output = manager.list()
    threads = []

    for v in range(10):
        output.append(v)
    print([str(v) for v in output])

    def process(inputs, start, end):
        while start < end:
            inputs[start] *= 10
            start += 1

    t1 = Process(target=process,
        args = (output, 0, 5))
    t2 = Process(target=process,
        args = (output, 5, 10))
    t1.start()
    t2.start()
    t1.join()
    t2.join()

    print([str(v) for v in output])

输出:

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
['0', '10', '20', '30', '40', '50', '60', '70', '80', '90']

但是,如果我有一个对象列表,请修改这些对象:

However, if I instead have a list of objects, and modify the objects:

class Container(object):
    def __init__(self, value):
        self.value = value
    def __str__(self):
        return "C" + str(self.value)

def multiply_containers():
    manager = Manager()
    output = manager.list()
    threads = []

    for v in range(10):
        output.append(Container(v))
    print([str(v) for v in output])

    def process(inputs, start, end):
        while start < end:
            inputs[start].value *= 10
            start += 1

    t1 = Process(target=process,
        args = (output, 0, 5))
    t2 = Process(target=process,
        args = (output, 5, 10))
    t1.start()
    t2.start()
    t1.join()
    t2.join()

    print([str(v) for v in output])

没有变化.

['C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9']
['C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9']

另一个问题是,我链接的SO帖子建议尝试写入数据结构会对其进行复制,这是我所不希望的.

Another issue is that the SO post I linked suggested that trying to write to the data structure would make a copy of it, which I don't want.

为阐明算法本身,第一步(建立图形)的工作原理如下:我有一个句子列表,这些句子是单词序列.我想建立一个有向图,其中每个顶点都是一个单词,在某个句子中,每个单词后面都有前沿.例如,如果我的输入是帽子里的猫"和房子里的猫",那么我的输出图将是=> cat => in => the => the => hat,house(即"the"有两个优势,一个是帽子",另一个是房屋").我还跟踪一些辅助信息,例如每个句子或单词的常见程度.每个顶点都有一个内边和外边以及一些属性的列表.

To clarify the algorithm itself, the first step (building up the graph) works something like this: I have a list of sentences, which are sequences of words. I would like to build a directed graph where each vertex is a word, with out-edges going to each word that follows it in some sentence. For example, if my input is "the cat in the hat" and "the cat in the house", my output graph would be the => cat => in => the => hat, house (that is, "the" has two out-edges, one to "hat" and one to "house"). I also keep track of some auxiliary information, such as how common each sentence or word is. Each vertex has a list of in- and out-edges and some attributes.

我找到了一个可能有效的模块( http://poshmodule.sourceforge.net/posh/html/),但我不确定是否存在规范"或推荐的方式来进行此类操作.

I found a module that might work (http://poshmodule.sourceforge.net/posh/html/) but I'm not sure if there's a "canonical" or recommended way to do this sort of thing.

谢谢!

推荐答案

此处的示例代码(有效)使用单独的Manager进程来控制对共享数据结构的访问,并基于您的示例代码以及问题中的代码 使用Managers在python中共享对象(类实例) @freakish所说的可能是评论中的重复问题-我不清楚是否是这样,但是整体方法似乎可以解决您的问题.

Here's sample code (that works) which uses a separate Manager process to control access to the shared data structure and is based on your example code plus that in the question Sharing object (class instance) in python using Managers which @freakish said might be a duplicate question in a comment -- it's not clear to me whether it is or not, but the overall approach seems like it might solve your problem.

from multiprocessing import Lock, Manager, Process
from multiprocessing.managers import BaseManager

class Container(object):
    def __init__(self, value):
        self.value = value
    def __str__(self):
        return "C" + str(self.value)
    def multiply(self, factor):  # added method
        self.value *= factor

def process(inputs, start, end):
    for i in range(start, end):
        inputs.apply(i, 'multiply', (10,))

class ListProxy(object):
    def __init__(self):
        self.nl = []
    def append(self, x):
        self.nl.append(x)
    def __getitem__(self, key):
        return self.nl[key]
    def __iter__(self):
        return iter(self.nl)
    def apply(self, i, method, args, **kwargs):
        getattr(self.nl[i], method)(*args, **kwargs)

class ListManager(BaseManager):
    pass

ListManager.register('ListProxy', ListProxy,
                     exposed=['append', '__getitem__', '__iter__', 'apply'])

def main():
    manager = ListManager()
    manager.start()
    output = manager.ListProxy()

    for v in range(10):
        output.append(Container(v))
    print([str(v) for v in output])

    t1 = Process(target=process, args=(output, 0, 5))
    t2 = Process(target=process, args=(output, 5, 10))
    t1.start()
    t2.start()
    t1.join()
    t2.join()

    print([str(v) for v in output])

if __name__ == '__main__':
    main()

这篇关于共享内存复杂的可写数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆