强制dask to_parquet写入单个文件 [英] Force dask to_parquet to write single file

查看：301 发布时间：2020/10/15 18:48:47 python pandas dask parquet

本文介绍了强制dask to_parquet写入单个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用 dask.to_parquet（df，文件名）时，将创建一个子文件夹文件名，并将多个文件写入其中该文件夹，而 pandas.to_parquet（df，filename）恰好写入一个文件。
我可以使用dask的 to_parquet （不使用 compute（）创建熊猫df）来编写

When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a pandas df) to just write a single file?

推荐答案

在并行系统中，很难将单个文件写入。抱歉，Dask（也可能没有其他任何并行处理库）没有提供这种选项。

Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library).

理论上，您可以完成大量的工作方面：您需要遍历数据帧的分区，写入目标文件（保持打开状态），并将输出行组累积到文件的最终元数据页脚中。我想知道如何使用fastparquet解决这个问题，但是该库已不再得到太多开发。

You could in theory perform the operation with a non-trivial amount of work on your part: you would need to iterate through the partitions of your dataframe, write to the target file (which you keep open) and accumulate the output row-groups into the final metadata footer of the file. I would know how to go about this with fastparquet, but that library is not being much developed any more.

这篇关于强制dask to_parquet写入单个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

强制dask to_parquet写入单个文件 [英] Force dask to_parquet to write single file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

强制dask to_parquet写入单个文件 [英] Force dask to_parquet to write single file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭