IPC 在单独的 Docker 容器中跨 Python 脚本共享内存 [英] IPC shared memory across Python scripts in separate Docker containers

查看:63
本文介绍了IPC 在单独的 Docker 容器中跨 Python 脚本共享内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个神经网络分类器,它接收大量图像(每张约 1-3 GB),将它们拼凑起来,然后将这些拼块单独通过网络.训练真的进行得很慢,所以我对它进行了基准测试,发现将一个图像中的补丁加载到内存中需要大约 50 秒(使用 Openslide 库),并且只有 ~.5 秒才能将它们传递到模型中.

I have written a neural network classifier that takes in massive images (~1-3 GB apiece), patches them up, and passes the patches through the network individually. Training was going really slowly, so I benchmarked it and found that it was taking ~50s to load the patches from one image into memory (using the Openslide library), and only ~.5 s to pass them through the model.

但是,我正在开发一台具有 1.5Tb RAM 的超级计算机,其中仅使用了约 26 Gb.数据集总共约 500Gb.我的想法是,如果我们可以将整个数据集加载到内存中,它将极大地加快训练速度.但我正在与一个研究团队合作,我们正在对多个 Python 脚本进行实验.因此,理想情况下,我希望在一个脚本中将整个数据集加载到内存中,并且能够跨所有脚本访问它.

However, I'm working on a supercomputer with 1.5Tb of RAM of which only ~26 Gb is being utilized. The dataset is a total of ~500Gb. My thinking is that if we could load the entire dataset into memory it would speed up training tremendously. But I am working with a research team and we are running experiments across multiple Python scripts. So ideally, I would like to load the entire dataset into memory in one script and be able to access it across all scripts.

更多细节:

  • 我们在单独的 Docker 容器(在同一台机器上)中运行我们的各个实验,因此数据集必须可以跨多个容器访问.
  • 数据集是 Camelyon16 数据集;图像以 .tif 格式存储.
  • 我们只需要读取图像,无需编写.
  • 我们一次只需要访问数据集的一小部分.
  • We run our individual experiments in separate Docker containers (on the same machine), so the dataset has to be accessible across multiple containers.
  • The dataset is the Camelyon16 Dataset; images are stored in .tif format.
  • We just need to read the images, no need to write.
  • We only need to access small portions of the dataset at a time.

我发现了很多关于如何跨多个 Python 脚本共享内存中的 Python 对象或原始数据的帖子:

I have found many posts about how to share Python objects or raw data in memory across multiple Python scripts:

多处理模块中带有 SyncManager 和 BaseManager 的服务器进程 |示例 1 |示例 2 |文档 - 服务器进程 |文档 - SyncManagers

Server Processes with SyncManager and BaseManager in the multiprocessing module | Example 1 | Example 2 | Docs - Server Processes | Docs - SyncManagers

  • 优点:可以通过网络由不同计算机上的进程共享(可以由多个容器共享吗?)
  • 可能的问题:根据文档,比使用共享内存慢.如果我们使用客户端/服务器在多个容器之间共享内存,这会比从磁盘读取的所有脚本更快吗?
  • 可能的问题:根据这个答案Manager 对象在发送对象之前会先处理对象,这可能会减慢速度.
  • Positives: Can be shared by processes on different computers over a network (can it be shared by multiple containers?)
  • Possible issue: slower than using shared memory, according to the docs. If we share memory across multiple containers using a client/server, will that be any faster than all of the scripts reading from disk?
  • Possible issue: according to this answer, the Manager object pickles objects before sending them, which could slow things down.

mmap 模块 |文档

  • 可能的问题:mmap 将文件映射到 虚拟内存,而不是物理内存 - 它会创建一个临时文件.
  • 可能的问题:因为我们一次只使用一小部分数据集,虚拟内存将整个数据集放在磁盘上,我们遇到了 颠簸 问题和程序日志.
  • Possible issue: mmap maps the file to virtual memory, not physical memory - it creates a temporary file.
  • Possible issue: because we use only a small portion of the dataset at a time, the virtual memory puts the entire dataset on disk, we run into thrashing issues and the program slogs.

Pyro4(Python 客户端-服务器)对象) |文档

Pyro4 (client-server for Python objects) | Docs

Python 的 sysv_ipc 模块.这个演示看起来很有前景.

The sysv_ipc module for Python. This demo looks promising.

  • 可能的问题:可能只是对内置 较低级别的曝光>多处理模块?

我还找到了此列表,用于 Python 中的 IPC/网络选项.

I also found this list of options for IPC/networking in Python.

有些讨论服务器-客户端设置,有些讨论序列化/反序列化,恐怕这比从磁盘读取需要更长的时间.我找到的答案都没有解决我的问题,即这些是否会导致 I/O 的性能改进.

Some discuss server-client setups, some discuss serialization/deserialization, which I'm afraid will take longer than just reading from disk. None of the answers I've found address my question about whether these will result in a performance improvement on I/O.

我们不仅需要跨脚本共享 Python 对象/内存;我们需要在 Docker 容器之间共享它们.

Not only do we need to share Python objects/memory across scripts; we need to share them across Docker containers.

Docker 文档 解释了--ipc 标志很好.根据正在运行的文档,对我来说有意义的事情是:

The Docker documentation explains the --ipc flag pretty well. What makes sense to me according to the documentation is running:

docker run -d --ipc=shareable data-server
docker run -d --ipc=container:data-server data-client

但是,当我在单独的容器中运行我的客户端和服务器时,使用如上所述设置的 --ipc 连接,它们无法相互通信.我读过的 SO 问题(12, 3, 4) 不解决单独 Docker 容器中 Python 脚本之间共享内存的集成问题.

But when I run my client and server in separate containers with an --ipc connection set up as described above, they are unable to communicate with each other. The SO questions I've read (1, 2, 3, 4) don't address integration of shared memory between Python scripts in separate Docker containers.

  • 1:这些中的任何一个都能提供比从磁盘读取更快的访问吗?认为跨进程/容器共享内存中的数据会提高性能是否合理?
  • 2:对于跨多个 Docker 容器共享内存中的数据,哪种解决方案最合适?
  • 3:如何将 Python 的内存共享解决方案与 docker run --ipc= 集成?(共享 IPC 命名空间是跨 Docker 容器共享内存的最佳方式吗?)
  • 4:是否有比这些更好的解决方案来解决我们的大量 I/O 开销问题?
  • 1: Would any of these provide faster access than reading from disk? Is it even reasonable to think that sharing data in memory across processes/containers would improve performance?
  • 2: Which would be most appropriate solution for sharing data in memory across multiple docker containers?
  • 3: How to integrate memory-sharing solutions from Python with docker run --ipc=<mode>? (is a shared IPC namespace even the best way to share memory across docker containers?)
  • 4: Is there a better solution than these to fix our problem of large I/O overhead?

这是我在不同容器中的 Python 脚本之间共享内存的幼稚方法.当 Python 脚本在同一个容器中运行时它有效,但当它们在不同的容器中运行时则无效.

This is my naive approach to memory sharing between Python scripts in separate containers. It works when the Python scripts are run the same container, but not when they are run in separate containers.

server.py

from multiprocessing.managers import SyncManager
import multiprocessing

patch_dict = {}

image_level = 2
image_files = ['path/to/normal_042.tif']
region_list = [(14336, 10752),
               (9408, 18368),
               (8064, 25536),
               (16128, 14336)]

def load_patch_dict():

    for i, image_file in enumerate(image_files):
        # We would load the image files here. As a placeholder, we just add `1` to the dict
        patches = 1
        patch_dict.update({'image_{}'.format(i): patches})

def get_patch_dict():
    return patch_dict

class MyManager(SyncManager):
    pass

if __name__ == "__main__":
    load_patch_dict()
    port_num = 4343
    MyManager.register("patch_dict", get_patch_dict)
    manager = MyManager(("127.0.0.1", port_num), authkey=b"password")
    # Set the authkey because it doesn't set properly when we initialize MyManager
    multiprocessing.current_process().authkey = b"password"
    manager.start()
    input("Press any key to kill server".center(50, "-"))
    manager.shutdown

client.py

from multiprocessing.managers import SyncManager
import multiprocessing
import sys, time

class MyManager(SyncManager):
    pass

MyManager.register("patch_dict")

if __name__ == "__main__":
    port_num = 4343

    manager = MyManager(("127.0.0.1", port_num), authkey=b"password")
    multiprocessing.current_process().authkey = b"password"
    manager.connect()
    patch_dict = manager.patch_dict()

    keys = list(patch_dict.keys())
    for key in keys:
        image_patches = patch_dict.get(key)
        # Do NN stuff (irrelevant)

当脚本在同一容器中运行时,这些脚本可以很好地共享图像.但是当它们在不同的容器中运行时,就像这样:

These scripts work fine for sharing the images when the scripts are run in the same container. But when they are run in separate containers, like this:

# Run the container for the server
docker run -it --name cancer-1 --rm --cpus=10 --ipc=shareable cancer-env
# Run the container for the client
docker run -it --name cancer-2 --rm --cpus=10 --ipc=container:cancer-1 cancer-env

我收到以下错误:

Traceback (most recent call last):
  File "patch_client.py", line 22, in <module>
    manager.connect()
  File "/usr/lib/python3.5/multiprocessing/managers.py", line 455, in connect
    conn = Client(self._address, authkey=self._authkey)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

推荐答案

我建议您尝试使用 tmpfs.

这是一个 linux 特性,允许您创建一个虚拟文件系统,所有这些都存储在 RAM 中.这允许非常快速的文件访问,并且只需一个 bash 命令即可设置.

It is a linux feature allowing you to create a virtual file system, all of which is stored in the RAM. This allows very fast file access and takes as little as one bash command to set up.

除了非常快速和直接之外,它还有很多优点:

In addition to being very fast and straight-forward, it has many advantages in your case:

  • 无需修改当前代码 - 数据集的结构保持不变
  • 无需额外的工作来创建共享数据集 - 只需将数据集 cp 放入 tmpfs
  • 通用接口 - 作为文件系统,您可以轻松地将 RAM 数据集与系统中不一定用 Python 编写的其他组件集成.例如,在您的容器中使用它很容易,只需将挂载的目录传递给它们即可.
  • 将适合其他环境 - 如果您的代码必须在不同的服务器上运行,tmpfs 可以调整页面并将其交换到硬盘驱动器.如果您必须在没有可用 RAM 的服务器上运行此程序,您可以将所有文件放在带有普通文件系统的硬盘驱动器上,而根本不接触您的代码.
  • No need to touch current code - the structure of the dataset stays the same
  • No extra work to create the shared dataset - just cp the dataset into the tmpfs
  • Generic interface - being a filesystem, you could easily integrate the on-RAM dataset with other component in your system that aren't necessarily written in python. For example, it would be easy to use inside your containers, just pass the mount's directory into them.
  • Will fit other environments - if your code will have to run on a different server, tmpfs can adapt and swap pages to the hard drive. If you will have to run this on a server with no free RAM, you could just have all your files on the hard drive with a normal filesystem and not touch your code at all.

使用步骤:

  1. 创建一个tmpfs - sudo mount -t tmpfs -o size=600G tmpfs/mnt/mytmpfs
  2. 复制数据集 - cp -r dataset/mnt/mytmpfs
  3. 将当前数据集的所有引用更改为新数据集
  4. 享受

<小时>

ramfs 在某些情况下可能比 tmpfs 更快,因为它没有实现页面交换.要使用它,只需将上述说明中的 tmpfs 替换为 ramfs.

ramfs might be faster than tmpfs in some cases as it doesn't implement page swapping. To use it just replace tmpfs with ramfs in the instructions above.

这篇关于IPC 在单独的 Docker 容器中跨 Python 脚本共享内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆