如何解决此HIVE_PARTITION_SCHEMA_MISMATCH? [英] How to solve this HIVE_PARTITION_SCHEMA_MISMATCH?

查看:1300
本文介绍了如何解决此HIVE_PARTITION_SCHEMA_MISMATCH?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在S3上的CSV文件中对数据进行了分区:

I have partitioned data in CSV files on S3:


  • s3://bucket/dataset/p=1/*.csv (分区#1)

  • ...

  • s3://bucket/dataset/p=100/*.csv(分区#100)

我在s3:// bucket / dataset /上运行了一个分类器,由于它检测到150列(c1 ,...,c150)并分配各种数据类型。

I run a classifier over s3://bucket/dataset/ and the result looks very much promising as it detects 150 columns (c1,...,c150) and assigns various data types.

在雅典娜中加载结果表并进行查询(从数据集限制10中选择* ),但会产生错误消息:

Loading the resulting table in Athena and querying (select * from dataset limit 10) it though will yield the error message:


HIVE_PARTITION_SCHEMA_MISMATCH:表
之间不匹配和分区架构。类型不兼容,不能强制
。表'tests.dataset'中的列'c100'被声明为
类型'string',但是分区'AANtbd7L1ajIwMTkwOQ'将该列
'c100'声明为类型'boolean'。

HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'c100' in table 'tests.dataset' is declared as type 'string', but partition 'AANtbd7L1ajIwMTkwOQ' declared column 'c100' as type 'boolean'.

首先我不知道如何使用'AANtbd7L1ajIwMTkwOQ'...但是我可以从Glue分区列表中得知某些分区具有c100分类为字符串,有些分类为布尔值。虽然表模式将其列出为字符串。

First of all I have no idea how to make use of 'AANtbd7L1ajIwMTkwOQ' ... but I can tell from the list of partitions in Glue that some partitions have c100 classified as string and some as boolean. While the table schema lists it as string.

这还意味着,如果我将查询限制为将c100分类为与表模式一致的字符串的分区,则该查询将起作用。如果我使用将c100归类为布尔值的分区,则查询失败,并显示上述错误消息。

That also means if I restrict a query to a partition which classifies c100 as string agreeing with the table schema then the query will work. If I use a partition classifying c100 as boolean the query fails with above error message.

现在,通过查看某些CSV列,c100似乎包含三个不同的值:

Now from having a look at some of the CSVs column c100 seems to contain three different values:


  • true

  • false

  • [空] (例如... ,, ...)

可能某行包含错字(也许),因此某些分区归类为字符串-但这只是一种理论,由于文件的数量和大小而难以验证。

Possibly some row contains a typo (maybe) and hence some partitions classify as string - but that is just a theory and a difficult to verify due to the number and size of the files.

我还尝试了 MSCK REPAIR TABLE数据集无济于事。

对此有快速解决方案吗?也许强迫所有分区使用字符串?如果查看分区列表,则会有一个停用的编辑模式按钮。

Is there a quick solution to this? Maybe forcing all partition to use string? If I look at the list of partitions there is a deactivated "edit schema" button.

或者我是否必须编写Glue作业检查并丢弃或修复每一行?

Or do I have to write a Glue job checking and discarding or repairing every row?

推荐答案

如果使用爬虫,则应选择以下选项:

If you are using crawler, you should select following option:

Update all new and existing partitions with metadata from the table



您也可以在创建表格时执行此操作。检查 https:// docs。 aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-schema-changes-prevent

解决问题。如果没有,请在 https://github.com/awsdocs/amazon-athena-user-guide/blob/master/doc_source/glue-best-practices.md#schema-syncing

This should solve issue. If it doesn't then check other options at https://github.com/awsdocs/amazon-athena-user-guide/blob/master/doc_source/glue-best-practices.md#schema-syncing

要了解雅典娜的问题,请检查 https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html

For understanding issue in athena, check https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html

这篇关于如何解决此HIVE_PARTITION_SCHEMA_MISMATCH?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆