可以在 2 个独立进程之间共享内存数据吗? [英] Possible to share in-memory data between 2 separate processes?

查看:62
本文介绍了可以在 2 个独立进程之间共享内存数据吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用 Twisted 的 xmlrpc 服务器.服务器在内存中存储了大量数据.是否可以运行一个辅助的、单独的 xmlrpc 服务器,它可以访问第一个服务器中内存中的对象?

因此,serverA 启动并创建了一个对象.serverB 启动并可以读取 serverA 中的对象.

* 编辑 *

要共享的数据是一个包含 100 万个元组的列表.

解决方案

无需对 Python 核心运行时进行一些深入和黑暗的重写(以允许强制分配器使用给定的共享内存段并确保不同进程之间的地址兼容) 没有办法在任何一般意义上共享内存中的对象".该列表将包含一百万个元组地址,每个元组由其所有项的地址组成,并且这些地址中的每一个都将由 pymalloc 以一种不可避免地在进程之间变化并遍布整个堆的方式进行分配.

在除 Windows 之外的几乎所有系统上,都可以生成一个子进程,该子进程基本上对父进程空间中的对象具有只读访问权限……只要父进程也不更改这些对象.这是通过调用 os.fork() 获得的,实际上它会快照"当前进程的所有内存空间,并在副本/快照上启动另一个同步进程.在所有现代操作系统上,这实际上非常快,这要归功于写时复制"方法:在 fork 之后没有被任一进程更改的虚拟内存页面实际上并没有被复制(对相同页面的访问改为共享);只要任一进程修改了先前共享页中的任何位,噗,该页就被复制,页表也被修改,因此修改进程现在拥有自己的副本,而另一个进程仍然可以看到原始副本.

这种极其有限的共享形式在某些情况下仍然可以挽救生命(尽管它非常有限:请记住,例如,由于引用计数,添加对共享对象的引用被视为更改"该对象,因此强制页面复制!)...当然,除了在 Windows 上,它不可用.有了这个例外(我认为这不会涵盖您的用例),共享包含指向其他对象的引用/指针的对象图基本上是不可行的——几乎所有对现代语言(包括 Python)感兴趣的对象集属于这种分类.

在极端(但足够简单)的情况下,可以通过放弃此类对象图的本机内存表示来获得共享.例如,一个包含 100 万个元组的列表,每个元组有 16 个浮点数,实际上可以表示为一个 128 MB 的共享内存块——所有 16M 浮点数以双精度 IEEE 表示首尾相连——加上一点垫片top 以使它看起来像"您正在以正常方式解决问题(当然,并不是那么小的毕竟 shim 还必须处理极其棘手的进程间同步问题肯定会出现;-).它只会从那里变得更加复杂和复杂.

现代的并发方法越来越不屑于任何共享的方法,而支持无共享的方法,其中任务通过消息传递进行通信(即使在使用线程和共享地址空间的多核系统中,同步问题和性能当多个内核同时主动修改大面积内存时,会在缓存、管道停顿等方面导致 HW 命中,从而将人们推开).

例如,Python 标准库中的多处理模块主要依赖酸洗和来回发送对象,而不是共享内存(肯定不是 R/W 方式!-).

我意识到这对 OP 来说不是个好消息,但是如果他确实需要让多个处理器工作,他最好考虑将他们必须共享的任何东西都驻留在可以访问和修改的地方消息传递——一个数据库、一个内存缓存集群、一个专门的进程,除了将这些数据保存在内存中并根据请求发送和接收它们,以及其他以消息传递为中心的架构.

I have an xmlrpc server using Twisted. The server has a huge amount of data stored in-memory. Is it possible to have a secondary, separate xmlrpc server running which can access the object in-memory in the first server?

So, serverA starts up and creates an object. serverB starts up and can read from the object in serverA.

* EDIT *

The data to be shared is a list of 1 million tuples.

解决方案

Without some deep and dark rewriting of the Python core runtime (to allow forcing of an allocator that uses a given segment of shared memory and ensures compatible addresses between disparate processes) there is no way to "share objects in memory" in any general sense. That list will hold a million addresses of tuples, each tuple made up of addresses of all of its items, and each of these addresses will have be assigned by pymalloc in a way that inevitably varies among processes and spreads all over the heap.

On just about every system except Windows, it's possible to spawn a subprocess that has essentially read-only access to objects in the parent process's space... as long as the parent process doesn't alter those objects, either. That's obtained with a call to os.fork(), that in practice "snapshots" all of the memory space of the current process and starts another simultaneous process on the copy/snapshot. On all modern operating systems, this is actually very fast thanks to a "copy on write" approach: the pages of virtual memory that are not altered by either process after the fork are not really copied (access to the same pages is instead shared); as soon as either process modifies any bit in a previously shared page, poof, that page is copied, and the page table modified, so the modifying process now has its own copy while the other process still sees the original one.

This extremely limited form of sharing can still be a lifesaver in some cases (although it's extremely limited: remember for example that adding a reference to a shared object counts as "altering" that object, due to reference counts, and so will force a page copy!)... except on Windows, of course, where it's not available. With this single exception (which I don't think will cover your use case), sharing of object graphs that include references/pointers to other objects is basically unfeasible -- and just about any objects set of interest in modern languages (including Python) falls under this classification.

In extreme (but sufficiently simple) cases one can obtain sharing by renouncing the native memory representation of such object graphs. For example, a list of a million tuples each with sixteen floats could actually be represented as a single block of 128 MB of shared memory -- all the 16M floats in double-precision IEEE representation laid end to end -- with a little shim on top to "make it look like" you're addressing things in the normal way (and, of course, the not-so-little-after-all shim would also have to take care of the extremely hairy inter-process synchronization problems that are certain to arise;-). It only gets hairier and more complicated from there.

Modern approaches to concurrency are more and more disdaining shared-anything approaches in favor of shared-nothing ones, where tasks communicate by message passing (even in multi-core systems using threading and shared address spaces, the synchronization issues and the performance hits the HW incurs in terms of caching, pipeline stalls, etc, when large areas of memory are actively modified by multiple cores at once, are pushing people away).

For example, the multiprocessing module in Python's standard library relies mostly on pickling and sending objects back and forth, not on sharing memory (surely not in a R/W way!-).

I realize this is not welcome news to the OP, but if he does need to put multiple processors to work, he'd better think in terms of having anything they must share reside in places where they can be accessed and modified by message passing -- a database, a memcache cluster, a dedicated process that does nothing but keep those data in memory and send and receive them on request, and other such message-passing-centric architectures.

这篇关于可以在 2 个独立进程之间共享内存数据吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆