在Dask Dataframe上使用set_index()并写入镶木地板会导致内存爆炸 [英] Using set_index() on a Dask Dataframe and writing to parquet causes memory explosion

查看:37
本文介绍了在Dask Dataframe上使用set_index()并写入镶木地板会导致内存爆炸的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组试图在列上排序的Parquet文件.未经压缩的数据约为14Gb,因此Dask似乎是完成此任务的正确工具.我对Dask所做的一切是:

I have a large set of Parquet files that I am trying to sort on a column. Uncompressed, the data is around ~14Gb, so Dask seemed like the right tool for the job. All I'm doing with Dask is:

  1. 读取实木复合地板文件
  2. 在其中一列上排序(称为朋友")
  3. 以实木复合地板文件形式编写在单独的目录中

如果没有Dask进程(只有一个,我正在使用同步调度程序),我将无法执行此操作,并且内存不足并被杀死.这让我感到惊讶,因为没有一个分区的未压缩容量超过300 mb.

I can't do this without the Dask process (there's just one, I'm using the synchronous scheduler) running out of memory and getting killed. This surprises me, because no one partition is more than ~300 mb uncompressed.

我已经编写了一个小脚本来对Dask进行数据集的逐步扩展分析,并且我注意到Dask的内存消耗随输入大小而定.这是脚本:

I've written a little script to profile Dask with progressively larger portions of my dataset, and I've noticed that Dask's memory consumption scales with the size of the input. Here's the script:

import os
import dask
import dask.dataframe as dd
from dask.diagnostics import ResourceProfiler, ProgressBar

def run(input_path, output_path, input_limit):
    dask.config.set(scheduler="synchronous")

    filenames = os.listdir(input_path)
    full_filenames = [os.path.join(input_path, f) for f in filenames]

    rprof = ResourceProfiler()
    with rprof, ProgressBar():
        df = dd.read_parquet(full_filenames[:input_limit])
        df = df.set_index("friend")
        df.to_parquet(output_path)

    rprof.visualize(file_path=f"profiles/input-limit-{input_limit}.html")

以下是 visualize()调用生成的图表:

Here are the charts produced by the visualize() call:

完整的数据集是约50个输入文件,因此以这种速度增长,我不感到惊讶,因为作业会耗尽我32gb计算机上的所有内存.

The full dataset is ~50 input files, so at this rate of growth I'm not surprised that job eats up all of the memory on my 32gb machine.

我的理解是,Dask的全部目的是使您可以处理大于内存的数据集.我给人的印象是,人们正在使用Dask处理远大于我的约14GB数据集的数据集.他们如何通过扩展内存消耗来避免此问题?我在这里做什么错了?

My understanding is that the whole point of Dask is to allow you to operate on larger-than-memory datasets. I get the impression that people are using Dask to process datasets far larger than my ~14gb one. How do they avoid this issue with scaling memory consumption? What am I doing wrong here?

我现在对使用其他调度程序或并行性不感兴趣.我只想知道为什么Dask消耗的内存比我原本认为的要多.

I'm not interested in using a different scheduler or in parallelism at this point. I'd just like to know why Dask is consuming so much more memory than I would have thought necessary.

推荐答案

事实证明,这是Dask的性能下降,已在2021.03.0版本中修复.

This turns out to have been a performance regression in Dask that was fixed in the 2021.03.0 release.

有关更多信息,请参见 Github问题.

See this Github issue for more info.

这篇关于在Dask Dataframe上使用set_index()并写入镶木地板会导致内存爆炸的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆