尝试覆盖Hive分区时写入__HIVE_DEFAULT_PARTITION__的行损坏 [英] Corrupt rows written to __HIVE_DEFAULT_PARTITION__ when attempting to overwrite Hive partition

查看:525
本文介绍了尝试覆盖Hive分区时写入__HIVE_DEFAULT_PARTITION__的行损坏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使用Spark 2.3覆盖Hive表中的分区时,我看到一些非常奇怪的行为

I am seeing some very odd behaviour when attempting to overwrite a partition in a Hive table using Spark 2.3

首先,我在构建SparkSession时设置以下设置:

Firstly I am setting the following setting when building my SparkSession:

.config("spark.sql.sources.partitionOverwriteMode", "dynamic")

然后我将一些数据复制到新表中,并按date_id列进行分区.

I am then copying some data into new table and partitioning by the date_id column.

ds
  .write
  .format("parquet")
  .option("compression", "snappy")
  .option("auto.purge", "true")
  .mode(saveMode)
  .partitionBy("date_id")
  .saveAsTable("tbl_copy")

我可以在HDFS中看到已经创建了相关的date_id目录.

I can see in HDFS that the relevant date_id directories have been created.

然后我创建一个数据集,其中包含要覆盖的分区的数据,该数据集包含单个date_id的数据,并按如下所示插入到Hive中:

I then create a DataSet containing data for the partition I wish to overwrite which contains data for a single date_id and insert into Hive as follows:

  ds
    .write
    .mode(SaveMode.Overwrite)
    .insertInto("tbl_copy")

作为健全性检查,我将相同的数据集写入新表.

As a sanity check I write the same Dataset to a new table.

      ds
        .write
        .format("parquet")
        .option("compression", "snappy")
        .option("auto.purge", "true")
        .mode(SaveMode.Overwrite)
        .saveAsTable("tmp_tbl")

tmp_tbl中的数据完全符合预期.

The data in tmp_tbl is exactly as expected.

但是,当我查看tbl_copy时,会看到一个新的HDFS目录`date_id = HIVE_DEFAULT_PARTITION

However when I look at tbl_copy I see a new HDFS directory `date_id=HIVE_DEFAULT_PARTITION

查询tbl_cpy

SELECT * from tbl_copy WHERE date_id IS NULL

我看到应该插入分区date_id = 20180523的行,但是date_id列为空,并且不相关的row_changed列已填充值20180523.

I see the rows that should have been inserted into partition date_id=20180523 however the date_id column is null and an unrelated row_changed column has been populated with value 20180523.

看来,插入Hive会导致我的数据混乱.将相同的数据集写入新表不会造成任何问题.

It appears the insert into Hive is somehow causing my data to get mangled. Writing the same Dataset into a new table causes no issues.

有人能对此有所启示吗?

Could anyone shed any light on this?

推荐答案

因此,看来分区列必须是数据集中的最后一个列.

So it appears that partition columns must be the last ones in the Dataset.

我已经通过将以下方法应用于Dataset [T]来解决了这个问题.

I have solved the problem by pimping the following method onto Dataset[T].

def partitionsTail(partitionColumns: Seq[String]) = {
  val columns = dataset.schema.collect{ case s if !partitionColumns.contains(s.name) => s.name} ++ partitionColumns

  dataset.select(columns.head, columns.tail: _*).as[T]
} 

这篇关于尝试覆盖Hive分区时写入__HIVE_DEFAULT_PARTITION__的行损坏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆