在Dask Dataframe上使用set_index()并写入镶木地板会导致内存爆炸 [英] Using set_index() on a Dask Dataframe and writing to parquet causes memory explosion

查看：37 发布时间：2021/4/28 19:34:31 python dask dask-dataframe

本文介绍了在Dask Dataframe上使用set_index()并写入镶木地板会导致内存爆炸的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一组试图在列上排序的Parquet文件.未经压缩的数据约为14Gb，因此Dask似乎是完成此任务的正确工具.我对Dask所做的一切是:

I have a large set of Parquet files that I am trying to sort on a column. Uncompressed, the data is around ~14Gb, so Dask seemed like the right tool for the job. All I'm doing with Dask is:

读取实木复合地板文件
在其中一列上排序(称为朋友")
以实木复合地板文件形式编写在单独的目录中

如果没有Dask进程(只有一个，我正在使用同步调度程序)，我将无法执行此操作，并且内存不足并被杀死.这让我感到惊讶，因为没有一个分区的未压缩容量超过300 mb.

I can't do this without the Dask process (there's just one, I'm using the synchronous scheduler) running out of memory and getting killed. This surprises me, because no one partition is more than ~300 mb uncompressed.

我已经编写了一个小脚本来对Dask进行数据集的逐步扩展分析，并且我注意到Dask的内存消耗随输入大小而定.这是脚本:

I've written a little script to profile Dask with progressively larger portions of my dataset, and I've noticed that Dask's memory consumption scales with the size of the input. Here's the script:

import os
import dask
import dask.dataframe as dd
from dask.diagnostics import ResourceProfiler, ProgressBar

def run(input_path, output_path, input_limit):
    dask.config.set(scheduler="synchronous")

    filenames = os.listdir(input_path)
    full_filenames = [os.path.join(input_path, f) for f in filenames]

    rprof = ResourceProfiler()
    with rprof, ProgressBar():
        df = dd.read_parquet(full_filenames[:input_limit])
        df = df.set_index("friend")
        df.to_parquet(output_path)

    rprof.visualize(file_path=f"profiles/input-limit-{input_limit}.html")

以下是 visualize()调用生成的图表:

Here are the charts produced by the visualize() call:

完整的数据集是约50个输入文件，因此以这种速度增长，我不感到惊讶，因为作业会耗尽我32gb计算机上的所有内存.

The full dataset is ~50 input files, so at this rate of growth I'm not surprised that job eats up all of the memory on my 32gb machine.

我的理解是，Dask的全部目的是使您可以处理大于内存的数据集.我给人的印象是，人们正在使用Dask处理远大于我的约14GB数据集的数据集.他们如何通过扩展内存消耗来避免此问题?我在这里做什么错了?

My understanding is that the whole point of Dask is to allow you to operate on larger-than-memory datasets. I get the impression that people are using Dask to process datasets far larger than my ~14gb one. How do they avoid this issue with scaling memory consumption? What am I doing wrong here?

我现在对使用其他调度程序或并行性不感兴趣.我只想知道为什么Dask消耗的内存比我原本认为的要多.

I'm not interested in using a different scheduler or in parallelism at this point. I'd just like to know why Dask is consuming so much more memory than I would have thought necessary.

在Dask Dataframe上使用set_index()并写入镶木地板会导致内存爆炸 [英] Using set_index() on a Dask Dataframe and writing to parquet causes memory explosion

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Dask Dataframe上使用set_index()并写入镶木地板会导致内存爆炸 [英] Using set_index() on a Dask Dataframe and writing to parquet causes memory explosion

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭