强制dask to_parquet写入单个文件 [英] Force dask to_parquet to write single file

查看:301
本文介绍了强制dask to_parquet写入单个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 dask.to_parquet(df,文件名)时,将创建一个子文件夹文件名,并将多个文件写入其中该文件夹,而 pandas.to_parquet(df,filename)恰好写入一个文件。
我可以使用dask的 to_parquet (不使用 compute()创建熊猫df)来编写

When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a pandas df) to just write a single file?

推荐答案

在并行系统中,很难将单个文件写入。抱歉,Dask(也可能没有其他任何并行处理库)没有提供这种选项。

Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library).

理论上,您可以完成大量的工作方面:您需要遍历数据帧的分区,写入目标文件(保持打开状态),并将输出行组累积到文件的最终元数据页脚中。我想知道如何使用fastparquet解决这个问题,但是该库已不再得到太多开发。

You could in theory perform the operation with a non-trivial amount of work on your part: you would need to iterate through the partitions of your dataframe, write to the target file (which you keep open) and accumulate the output row-groups into the final metadata footer of the file. I would know how to go about this with fastparquet, but that library is not being much developed any more.

这篇关于强制dask to_parquet写入单个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆