将Dask分区写入单个文件 [英] Writing Dask partitions into single file
问题描述
黄昏
的新手,当我在中读取该文件时,我有一个
数据框,它在我写文件时在更改文件后创建了约50个分区,它创建的文件数与分区数相同。 1GB
CSV文件dask
有没有办法写所有分区到单个CSV文件,是否可以访问分区?
谢谢。
New to dask
,I have a 1GB
CSV file when I read it in dask
dataframe it creates around 50 partitions after my changes in the file when I write, it creates as many files as partitions.
Is there a way to write all partitions to single CSV file and is there a way access partitions?
Thank you.
推荐答案
简短回答
否,Dask.dataframe.to_csv仅将CSV文件写入不同的文件,每个分区一个文件。但是,有一些解决方法。
Short answer
No, Dask.dataframe.to_csv only writes CSV files to different files, one file per partition. However, there are ways around this.
也许只是在dask.dataframe写完之后才将文件串联他们?就性能而言,这可能是接近最佳的。
Perhaps just concatenate the files after dask.dataframe writes them? This is likely to be near-optimal in terms of performance.
df.to_csv('/path/to/myfiles.*.csv')
from glob import glob
filenames = glob('/path/to/myfiles.*.csv')
with open('outfile.csv', 'w') as out:
for fn in filenames:
with open(fn) as f:
out.write(f.read()) # maybe add endline here as well?
或使用Dask.delayed
但是,您可以使用 dask.delayed 来自己做。 http://dask.pydata.org/en/latest/delayed-collections.html rel = noreferrer>与数据框一起使用dask.delay
Or use Dask.delayed
However, you can do this yourself using dask.delayed, by using dask.delayed alongside dataframes
这会为您提供您可以使用的延迟值列表,但您可以使用:
This gives you a list of delayed values that you can use however you like:
list_of_delayed_values = df.to_delayed()
然后由您来构建计算结构,以将这些分区顺序写入单个文件。这并非难事,但会在调度程序上造成一些备份。
It's then up to you to structure a computation to write these partitions sequentially to a single file. This isn't hard to do, but can cause a bit of backup on the scheduler.
编辑1:(2019年10月23日)
在Dask 2.6.x中,有一个参数为 single_file
。默认情况下,它为 False
。您可以将其设置为 True
以获得单个文件输出,而无需使用 df.compute()
。
In Dask 2.6.x, there is a parameter as single_file
. By default, It is False
. You can set it True
to get single file output without using df.compute()
.
例如:
df.to_csv('/path/to/myfiles.csv', single_file = True)
参考: to_csv的文档
这篇关于将Dask分区写入单个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!