将 RDD 分区写入其自己目录中的单个镶木地板文件 [英] Writing RDD partitions to individual parquet files in its own directory

查看：23 发布时间：2021/11/14 21:42:51 scala apache-spark apache-spark-sql rdd parquet

本文介绍了将 RDD 分区写入其自己目录中的单个镶木地板文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在努力解决我想将每个 RDD 分区写入具有自己的目录的单独镶木地板文件的步骤.示例将是:

I am struggling with step where I want to write each RDD partition to separate parquet file with its own directory. Example will be:

    <root>
        <entity=entity1>
            <year=2015>
                <week=45>
                    data_file.parquet

这种格式的优点是我可以直接在 SparkSQL 中使用它作为列，我不必在实际文件中重复这些数据.这将是访问特定分区的好方法，而无需在其他地方存储单独的分区元数据.

Advantage of this format is I can use this directly in SparkSQL as columns and I will not have to repeat this data in actual file. This would be good way to get to get to specific partition without storing separate partitioning metadata someplace else.

作为前面的步骤，我从大量 gzip 文件中加载了所有数据，并根据上述键进行了分区.

As a preceding step I have all the data loaded from large number of gzip files and partitioned based on the above key.

可能的方法是将每个分区作为单独的 RDD，然后编写它，尽管我找不到任何好的方法.

Possible way would be to get each partition as separate RDD and then write it though I couldn't find any good way of doing it.

任何帮助将不胜感激.顺便说一下，我是这个堆栈的新手.

Any help will be appreciated. By the way I am new to this stack.

将 RDD 分区写入其自己目录中的单个镶木地板文件 [英] Writing RDD partitions to individual parquet files in its own directory

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将 RDD 分区写入其自己目录中的单个镶木地板文件 [英] Writing RDD partitions to individual parquet files in its own directory

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭