分布式内存错误 [英] dask distributed memory error

查看:89
本文介绍了分布式内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在分布式作业上运行Dask时,在调度程序上遇到以下错误:

I got the following error on the scheduler while running Dask on a distributed job:

distributed.core - ERROR -
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/distributed/core.py", line 269, in write
    frames = protocol.dumps(msg)
  File "/usr/local/lib/python3.4/dist-packages/distributed/protocol.py", line 81, in dumps
    frames = dumps_msgpack(small)
  File "/usr/local/lib/python3.4/dist-packages/distributed/protocol.py", line 153, in dumps_msgpack
    payload = msgpack.dumps(msg, use_bin_type=True)
  File "/usr/local/lib/python3.4/dist-packages/msgpack/__init__.py", line 47, in packb
    return Packer(**kwargs).pack(o)
  File "msgpack/_packer.pyx", line 231, in msgpack._packer.Packer.pack (msgpack/_packer.cpp:231)
  File "msgpack/_packer.pyx", line 239, in msgpack._packer.Packer.pack (msgpack/_packer.cpp:239)
MemoryError

是这会耗尽调度程序或其中一个工作程序上的内存吗?还是两者兼有?

Is this running out of memory on the scheduler or on one of the workers? Or both??

推荐答案

此错误的最常见原因是试图收集过多的数据,例如以下情况使用dask.dataframe的示例:

The most common cause of this error is trying to collect too much data, such as occurs in the following example using dask.dataframe:

df = dd.read_csv('s3://bucket/lots-of-data-*.csv')
df.compute()

这会将所有数据加载到RAM跨集群(这很好),然后尝试通过调度程序将整个结果返回到本地计算机(调度程序可能无法在一处处理您100 GB的数据。)客户端通信通过调度程序进行,因此它是第一台接收所有数据的计算机,并且是第一台可能发生故障的计算机。

This loads all of the data into RAM across the cluster (which is fine), and then tries to bring the entire result back to the local machine by way of the scheduler (which probably can't handle your 100's of GB of data all in one place.) Worker-to-client communications pass through the Scheduler, so it is the first single machine to receive all of the data and the first machine likely to fail.

如果是这种情况,那么您相反,可能想使用 Executor.persist 方法来触发计算,但将其保留在群集中。

If this is the case then you instead probably want to use the Executor.persist method, to trigger computation but leave it on the cluster.

df = dd.read_csv('s3://bucket/lots-of-data-*.csv')
df = e.persist(df)

通常,我们只将 df.compute()用于要在本地查看的小结果会话。

Generally we only use df.compute() for small results that we want to view in our local session.

这篇关于分布式内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆