如何忽略亚马逊雅典娜结构顺序 [英] How to ignore amazon athena struct order

查看:74
本文介绍了如何忽略亚马逊雅典娜结构顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到一个 HIVE_PARTITION_SCHEMA_MISMATCH 错误,但我不确定该怎么做。当我查看2种不同的架构时,唯一不同的是我的一个结构(由粘合爬行器创建)中的键顺序。我真的不在乎数据的顺序,我以JSON Blob的形式接收数据,所以我不能保证键的顺序。

I'm getting an HIVE_PARTITION_SCHEMA_MISMATCH error that I'm not quite sure what to do about. When I look at the 2 different schemas, the only thing that's different is the order of the keys in one of my structs (created by a glue crawler). I really don't care about the order of the data, and I'm receiving the data as a JSON blob, so I cannot guarantee the order of the keys.

struct<device_id:string,user_id:string,payload:array<struct<channel:string,sensor_id:string,type:string,unit:string,value:double,name:string>>,topic:string,channel:string,client_id:string,hardware_id:string,timestamp:bigint,application_id:string>
struct<device_id:string,user_id:string,payload:array<struct<channel:string,name:string,sensor_id:string,type:string,unit:string,value:double>>,topic:string,channel:string,client_id:string,hardware_id:string,timestamp:bigint,application_id:string>


推荐答案

我建议您停止使用Glue搜寻器。这可能不是您所希望的答复,但是爬虫确实很糟糕。有时它们可​​能是有用的一种方法,它是从别人产生的随机数据堆中获取模式的一种方式,您不想花时间去研究它的模式是什么-但是一旦有了模式,并且您知道新数据将遵循该模式,Glue爬网程序就这样进行,并会产生不必要的问题,例如您遇到的问题。

I suggest you stop using Glue crawlers. It's probably not the response you had hoped for, but crawlers are really bad at their job. They can be useful sometimes as a way to get a schema from a random heap of data that someone else produced and that you don't want to spend time looking at to figure out what its schema is – but once you have a schema, and you know that new data will follow that schema, Glue crawlers are just in the way, and produce unnecessary problems like the one you have encountered.

该怎么做取决于如何将新数据添加到S3。

What to do instead depends on how new data is added to S3.

如果您控制生成数据的代码,则可以添加在上传数据后添加分区的代码。此解决方案的好处是,在产生新数据后立即添加分区,因此表始终是最新的。但是,它可能会以一种不希望的方式将数据生成代码与Glue(如果希望通过SQL添加分区,则将Athena紧密耦合)在一起。

If you are in control of the code that produces the data, you can add code that adds partitions after the data has been uploaded. The benefit of this solution is that partitions are added immediately after new data has been produced so tables are always up to date. However, it might tightly couple the data producing code with Glue (or Athena if you prefer to add partitions through SQL) in a way that is not desirable.

从产生数据的代码中添加分区是没有意义的,您可以创建一个执行该操作的Lambda函数。您可以将其设置为每天固定时间运行(如果您知道新数据的位置,则不必等到它存在时,分区可以指向空位置),也可以通过S3通知触发它(如果有多个文件,您可以找到一种方法来通过SQS对通知进行防抖动处理,或者一次又一次创建分区,如果分区已经存在,则只需吞下错误即可。)

If it doesn't make sense to add the partitions from the code that produces the data, you can create a Lambda function that does it. You can either set it to run at a fixed time every day (if you know the location of the new data you don't have to wait until it exists, partitions can point to empty locations), or you can trigger it by S3 notifications (if there are multiple files you can either figure out a way to debounce the notifications through SQS or just create the partition over and over again, just swallow the error if the partition already exists).

您可能还听说过 MSCK维修台... 。在某些方面,它比Glue搜寻器要好,但在其他方面,它也一样。它只会添加新的分区,而永远不会更改架构,这通常是您想要的,但是效率极低,并且运行的速度越来越慢,文件越多越慢。有点像胶履带。

You may also have heard of MSCK REPAIR TABLE …. It's better than Glue crawlers in some ways, but just as bad in other ways. It will only add new partitions, never change the schema, which is usually what you want, but it's extremely inefficient, and runs slower and slower the more files there are. Kind of like Glue crawlers.

这篇关于如何忽略亚马逊雅典娜结构顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆