将Dask分区写入单个文件 [英] Writing Dask partitions into single file

查看:94
本文介绍了将Dask分区写入单个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

黄昏的新手,当我在中读取该文件时,我有一个 1GB CSV文件dask 数据框,它在我写文件时在更改文件后创建了约50个分区,它创建的文件数与分区数相同。

有没有办法写所有分区到单个CSV文件,是否可以访问分区?

谢谢。

New to dask,I have a 1GB CSV file when I read it in dask dataframe it creates around 50 partitions after my changes in the file when I write, it creates as many files as partitions.
Is there a way to write all partitions to single CSV file and is there a way access partitions?
Thank you.

推荐答案

简短回答



否,Dask.dataframe.to_csv仅将CSV文件写入不同的文件,每个分区一个文件。但是,有一些解决方法。

Short answer

No, Dask.dataframe.to_csv only writes CSV files to different files, one file per partition. However, there are ways around this.

也许只是在dask.dataframe写完之后才将文件串联他们?就性能而言,这可能是接近最佳的。

Perhaps just concatenate the files after dask.dataframe writes them? This is likely to be near-optimal in terms of performance.

df.to_csv('/path/to/myfiles.*.csv')
from glob import glob
filenames = glob('/path/to/myfiles.*.csv')
with open('outfile.csv', 'w') as out:
    for fn in filenames:
        with open(fn) as f:
            out.write(f.read())  # maybe add endline here as well?



或使用Dask.delayed



但是,您可以使用 dask.delayed 来自己做。 http://dask.pydata.org/en/latest/delayed-collections.html rel = noreferrer>与数据框一起使用dask.delay

Or use Dask.delayed

However, you can do this yourself using dask.delayed, by using dask.delayed alongside dataframes

这会为您提供您可以使用的延迟值列表,但您可以使用:

This gives you a list of delayed values that you can use however you like:

list_of_delayed_values = df.to_delayed()

然后由您来构建计算结构,以将这些分区顺序写入单个文件。这并非难事,但会在调度程序上造成一些备份。

It's then up to you to structure a computation to write these partitions sequentially to a single file. This isn't hard to do, but can cause a bit of backup on the scheduler.

编辑1:(2019年10月23日)

在Dask 2.6.x中,有一个参数为 single_file 。默认情况下,它为 False 。您可以将其设置为 True 以获得单个文件输出,而无需使用 df.compute()

In Dask 2.6.x, there is a parameter as single_file. By default, It is False. You can set it True to get single file output without using df.compute().

例如:

df.to_csv('/path/to/myfiles.csv', single_file = True)

参考: to_csv的文档

这篇关于将Dask分区写入单个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆