输入数据格式更改时,雅典娜会将数据放入不正确的列中 [英] Athena puts data in incorrect columns when input data format changes

查看:63
本文介绍了输入数据格式更改时,雅典娜会将数据放入不正确的列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在S3的文件夹中有一些用管道分隔的.txt报告,我们在其上运行Glue搜寻器以确定架构并在Athena中进行查询.

We have some pipe delimited .txt reports coming into a folder in S3, on which we run Glue crawler to determine the schema and query in Athena.

报告的格式最近发生了变化,因此中间有两个新列.

The format of the report changed recently so there are two new columns in the middle.

旧文件:

Columns A  B  C  D  E  F
Data    a1 b1 c1 d1 e1 f1

带有"G"和"H"列的新文件:

New files with extra "G" and "H" columns:

Columns A  B  G  H  C  D  E  F
Data    a2 b2 g2 h2 c2 d2 e2 f2

我们在爬网程序创建的表中所得到的内容如在雅典娜中所见:

What we get in the table created by the crawler as seen in Athena:

Columns A  B  C  D  E  F  G  H    <- Puts new columns at the end. OK
Data    a1 b1 c1 d1 e1 f1         <- Correct for old data
Data    a2 b2 g2 h2       e2 f2   <- 4 columns incorrect and 2 missing

这是粘合爬虫中的某种错误,还是有一种配置此错误的方法,以便将正确的数据放在正确的列中(而不是运行数据清理脚本来转换输入文件)?

Is this some sort of bug in glue crawler, or is there a way to configure this so it puts the right data in the right columns (other than runnning a data cleaning script to transform the input files)?

推荐答案

我认为这是Glue承诺过多和交付不足的另一种情况.只要数据格式是定界的,如果在中间添加列,Glue就会做错事情.在末尾添加或删除(但不能同时删除)两列,但不能在中间添加或删除.雅典娜不为不同的分区支持不同的列,因此Glue不可能做到这一点-但它看起来像它可以做到的那样.

I think this is yet another case of Glue overpromising and underdelivering. As long as the data format is delimited text Glue will do the wrong thing if you add columns in the middle. Adding or removing (but not both) columns to the end works, but not in the middle. Athena does not support different columns for different partitions, so there is no way that Glue could make this work – but it makes it look like it can.

您将不得不重写数据,更改为最后添加列,或者切换到其他数据格式,其中文件包含足够的元数据,这样就不会造成问题:JSON,Avro或Parquet.

You will either have to rewrite the data, change to add the columns last, or switch to a different data format where the files contain enough metadata for this not to be a problem: JSON, Avro, or Parquet.

我建议您完全停止使用Glue搜寻器,它看起来像是通用工具,但确实解决了很少的用例.请参阅 https://stackoverflow.com/a/56439429/1109 ,以获取有关替代方法的一些建议.

I would suggest you stop using Glue crawlers altogether, it looks like it's a general tool, but really solves few use cases. See https://stackoverflow.com/a/56439429/1109 for some suggestions on what to do instead.

这篇关于输入数据格式更改时,雅典娜会将数据放入不正确的列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆