如何可靠地写入和还原分区数据 [英] How to reliably write and restore partitioned data

查看:150
本文介绍了如何可靠地写入和还原分区数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种写入和还原分区数据集的方法.出于这个问题的目的,我可以接受两个分区的RDD:

I am looking for a way to write and restore partitioned dataset. For the purpose of this question I can accept both partitioned RDD:

val partitioner: org.apache.spark.Partitioner = ???
rdd.partitionBy(partitioner)

Dataset[Row]/Dataframe:

df.repartition($"someColumn")

目标是在恢复数据时避免随机播放.例如:

The goal is to avoid shuffle when data is restored. For example:

spark.range(n).withColumn("foo", lit(1))
  .repartition(m, $"id")
  .write
  .partitionBy("id")
  .parquet(path)

不应要求进行以下操作:

shouldn't require shuffle for:

spark.read.parquet(path).repartition(m, $"id")

我曾考虑将分区Dataset写入Parquet,但我相信Spark不会使用此信息.

I thought about writing partitioned Dataset to Parquet but I believe that Spark doesn't use this information.

我只能使用磁盘存储,而不能使用数据库或数据网格.

I can work only with disk storage not a database or data grid.

推荐答案

它可能是通过bucketBy在dataframe/dataset api中实现的,但是有一个陷阱-直接保存到实木复合地板将不起作用,只有saveAsTable可以起作用./p>

It might be achieved by bucketBy in dataframe/dataset api probably, but there is a catch - directly saving to parquet won't work, only saveAsTable works.

Dataset<Row> parquet =...;
parquet.write()
  .bucketBy(1000, "col1", "col2")
  .partitionBy("col3")
  .saveAsTable("tableName");

sparkSession.read().table("tableName");

火花核心的另一种方法是使用自定义RDD,例如,参见 https://github.com/apache/spark/pull/4449 -即在读完hdfs rdd之后,您可以设置分区器了,但是它有点hacky,本机不支持(因此需要针对每个spark版本进行调整)

Another apporach for spark core is to use custom RDD, e.g see https://github.com/apache/spark/pull/4449 - i.e. after reading hdfs rdd you kind of setup partitioner back, but it a bit hacky and not supported natively(so it need to be adjusted for every spark version)

这篇关于如何可靠地写入和还原分区数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆