使用spark'save'时出错,目前不支持存储 [英] Error using spark 'save' does not support bucketing right now

查看:227
本文介绍了使用spark'save'时出错,目前不支持存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个DataFrame,我试图将其列为partitionBy,按该列对其进行排序,并使用以下命令以拼花格式保存:

I have a DataFrame which I am trying to partitionBy a column, sort it by that column and save in parquet format using the following command:

df.write().format("parquet")
  .partitionBy("dynamic_col")
  .sortBy("dynamic_col")
  .save("test.parquet");

我收到以下错误:

reason: User class threw exception: org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now;

不允许save(...)吗? 仅允许将数据保存到Hive的saveAsTable(...)吗?

Is save(...) not allowed? Is only saveAsTable(...) allowed which saves the data to Hive?

任何建议都是有帮助的.

Any suggestions are helpful.

推荐答案

问题在于,当前仅支持sortBy(Spark 2.3.1)与存储区一起使用,并且需要与saveAsTable结合使用存储区和同样,存储区排序列也不应成为分区列的一部分.

The problem is that sortBy is currently (Spark 2.3.1) supported only together with bucketing and bucketing needs to be used in combination with saveAsTable and also the bucket sorting column should not be part of partition columns.

因此,您有两个选择:

  1. 请勿使用sortBy:

df.write
.format("parquet")
.partitionBy("dynamic_col")
.option("path", output_path)
.save()

  • 使用sortBy进行存储分区,并使用saveAsTable将其保存在元存储中:

  • Use sortBy with bucketing and save it through the metastore using saveAsTable:

    df.write
    .format("parquet")
    .partitionBy("dynamic_col")
    .bucketBy(n, bucket_col)
    .sortBy(bucket_col)
    .option("path", output_path)
    .saveAsTable(table_name)
    

  • 这篇关于使用spark'save'时出错,目前不支持存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆