输入数据格式更改时,雅典娜会将数据放入不正确的列中 [英] Athena puts data in incorrect columns when input data format changes
问题描述
我们在S3的文件夹中有一些用管道分隔的.txt报告,我们在其上运行Glue搜寻器以确定架构并在Athena中进行查询.
We have some pipe delimited .txt reports coming into a folder in S3, on which we run Glue crawler to determine the schema and query in Athena.
报告的格式最近发生了变化,因此中间有两个新列.
The format of the report changed recently so there are two new columns in the middle.
旧文件:
Columns A B C D E F
Data a1 b1 c1 d1 e1 f1
带有"G"和"H"列的新文件:
New files with extra "G" and "H" columns:
Columns A B G H C D E F
Data a2 b2 g2 h2 c2 d2 e2 f2
我们在爬网程序创建的表中所得到的内容如在雅典娜中所见:
What we get in the table created by the crawler as seen in Athena:
Columns A B C D E F G H <- Puts new columns at the end. OK
Data a1 b1 c1 d1 e1 f1 <- Correct for old data
Data a2 b2 g2 h2 e2 f2 <- 4 columns incorrect and 2 missing
这是粘合爬虫中的某种错误,还是有一种配置此错误的方法,以便将正确的数据放在正确的列中(而不是运行数据清理脚本来转换输入文件)?
Is this some sort of bug in glue crawler, or is there a way to configure this so it puts the right data in the right columns (other than runnning a data cleaning script to transform the input files)?
推荐答案
我认为这是Glue承诺过多和交付不足的另一种情况.只要数据格式是定界的,如果在中间添加列,Glue就会做错事情.在末尾添加或删除(但不能同时删除)两列,但不能在中间添加或删除.雅典娜不为不同的分区支持不同的列,因此Glue不可能做到这一点-但它看起来像它可以做到的那样.
I think this is yet another case of Glue overpromising and underdelivering. As long as the data format is delimited text Glue will do the wrong thing if you add columns in the middle. Adding or removing (but not both) columns to the end works, but not in the middle. Athena does not support different columns for different partitions, so there is no way that Glue could make this work – but it makes it look like it can.
您将不得不重写数据,更改为最后添加列,或者切换到其他数据格式,其中文件包含足够的元数据,这样就不会造成问题:JSON,Avro或Parquet.
You will either have to rewrite the data, change to add the columns last, or switch to a different data format where the files contain enough metadata for this not to be a problem: JSON, Avro, or Parquet.
我建议您完全停止使用Glue搜寻器,它看起来像是通用工具,但确实解决了很少的用例.请参阅 https://stackoverflow.com/a/56439429/1109 ,以获取有关替代方法的一些建议.
I would suggest you stop using Glue crawlers altogether, it looks like it's a general tool, but really solves few use cases. See https://stackoverflow.com/a/56439429/1109 for some suggestions on what to do instead.
这篇关于输入数据格式更改时,雅典娜会将数据放入不正确的列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!