写RDD分区单独实木复合地板的文件在自己的目录 [英] Writing RDD partitions to individual parquet files in its own directory
问题描述
我与步骤,其中我想写每个RDD分区独立拼花文件,它自己的目录挣扎。例如将是:
I am struggling with step where I want to write each RDD partition to separate parquet file with its own directory. Example will be:
<root>
<entity=entity1>
<year=2015>
<week=45>
data_file.parquet
这种格式的优点是我可以在SparkSQL直接使用这个栏目,我不会有实际的文件重复此数据。这将是很好的方式去得到特定的分区没有存储独立分区的元数据别的地方。
Advantage of this format is I can use this directly in SparkSQL as columns and I will not have to repeat this data in actual file. This would be good way to get to get to specific partition without storing separate partitioning metadata someplace else.
作为preceding一步,我已经从大量的gzip文件加载,基于上述关键分区中的所有数据。
As a preceding step I have all the data loaded from large number of gzip files and partitioned based on the above key.
可能的办法是让每个分区作为单独的RDD,然后写它,虽然我找不到做什么好办法。
Possible way would be to get each partition as separate RDD and then write it though I couldn't find any good way of doing it.
任何帮助将AP preciated。顺便说一句我是新来这个堆栈中。
Any help will be appreciated. By the way I am new to this stack.
推荐答案
我认为这是可能通过调用 foreachPartition(F:迭代器[T] =&GT;单位)
上在RDD要保存。
I think it's possible by calling foreachPartition(f: Iterator[T] => Unit)
on the RDD you want to save.
在你进入所提供的功能 foreachPartition
:
In the function you provided into foreachPartition
:
- prepare路径
HDFS://本地主机:9000 / parquet_data /年= X /周= Y
- 在<一个href=\"https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala#L255-281\"相对=nofollow> ParquetWriter
- 插入每一行入排气RecordWriter的迭代器。
- 清理
这篇关于写RDD分区单独实木复合地板的文件在自己的目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!