写RDD分区单独实木复合地板的文件在自己的目录 [英] Writing RDD partitions to individual parquet files in its own directory

查看：187 发布时间：2016/5/22 15:33:04 scala apache-spark apache-spark-sql rdd parquet

本文介绍了写RDD分区单独实木复合地板的文件在自己的目录的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我与步骤，其中我想写每个RDD分区独立拼花文件，它自己的目录挣扎。例如将是：

I am struggling with step where I want to write each RDD partition to separate parquet file with its own directory. Example will be:

    <root>
        <entity=entity1>
            <year=2015>
                <week=45>
                    data_file.parquet

这种格式的优点是我可以在SparkSQL直接使用这个栏目，我不会有实际的文件重复此数据。这将是很好的方式去得到特定的分区没有存储独立分区的元数据别的地方。

Advantage of this format is I can use this directly in SparkSQL as columns and I will not have to repeat this data in actual file. This would be good way to get to get to specific partition without storing separate partitioning metadata someplace else.

作为preceding一步，我已经从大量的gzip文件加载，基于上述关键分区中的所有数据。

As a preceding step I have all the data loaded from large number of gzip files and partitioned based on the above key.

可能的办法是让每个分区作为单独的RDD，然后写它，虽然我找不到做什么好办法。

Possible way would be to get each partition as separate RDD and then write it though I couldn't find any good way of doing it.

任何帮助将AP preciated。顺便说一句我是新来这个堆栈中。

Any help will be appreciated. By the way I am new to this stack.

写RDD分区单独实木复合地板的文件在自己的目录 [英] Writing RDD partitions to individual parquet files in its own directory

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

写RDD分区单独实木复合地板的文件在自己的目录 [英] Writing RDD partitions to individual parquet files in its own directory

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭