强制dask to_parquet写入单个文件 [英] Force dask to_parquet to write single file
问题描述
使用 dask.to_parquet(df,文件名)
时,将创建一个子文件夹文件名
,并将多个文件写入其中该文件夹,而 pandas.to_parquet(df,filename)
恰好写入一个文件。
我可以使用dask的 to_parquet
(不使用 compute()
创建熊猫df)来编写
When using dask.to_parquet(df, filename)
a subfolder filename
is created and several files are written to that folder, whereas pandas.to_parquet(df, filename)
writes exactly one file.
Can I use dask's to_parquet
(without using compute()
to create a pandas df) to just write a single file?
推荐答案
在并行系统中,很难将单个文件写入。抱歉,Dask(也可能没有其他任何并行处理库)没有提供这种选项。
Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library).
理论上,您可以完成大量的工作方面:您需要遍历数据帧的分区,写入目标文件(保持打开状态),并将输出行组累积到文件的最终元数据页脚中。我想知道如何使用fastparquet解决这个问题,但是该库已不再得到太多开发。
You could in theory perform the operation with a non-trivial amount of work on your part: you would need to iterate through the partitions of your dataframe, write to the target file (which you keep open) and accumulate the output row-groups into the final metadata footer of the file. I would know how to go about this with fastparquet, but that library is not being much developed any more.
这篇关于强制dask to_parquet写入单个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!