AWS Athena中的文件系统上缺少表 [英] Tables missing on filesystem in AWS Athena

查看:201
本文介绍了AWS Athena中的文件系统上缺少表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在Athena上使用此代码创建了一个具有自动分区功能的表.

I've created a table with auto partitioning with this code on Athena.

CREATE EXTERNAL TABLE IF NOT EXISTS matchdata.stattable (
  `matchResult` string,
  ...
) PARTITIONED BY (
  year int ,
  month int,
  day int
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://******/data/year=2019/month=8/day=2/'
TBLPROPERTIES ('has_encrypted_data'='false');

并且我运行了 MSCK REPAIR TABLE stattable ,但是文件系统上的表丢失了,查询结果是返回零记录. matchdata.stattable得到相同的结果.

and I ran MSCK REPAIR TABLE stattable, but got Tables missing on filesystem and query result is zero records returned. matchdata.stattable gets same result.

另一个没有分区的表,查询工作正常.但是随着服务的继续和数据集的增长,我必须进行分区.

Another table without partitioning, the query works fine. But as the service continues and dataset gets grow, I must go with partitioning.

示例数据路径为data/2019/8/2/1SxFHaUeHfesLtPs._BjDk.gz.我该如何解决这个问题?

The example data path is data/2019/8/2/1SxFHaUeHfesLtPs._BjDk.gz. How can I settle this issue?

推荐答案

正如您所发现的(但对于遇到相同问题的人还有更多的了解),MSCK REPAIR TABLE …仅了解Hive样式分区,例如/data/year=2019/month=08/day=10/file.json.该命令的真正作用是扫描S3上与表的LOCATION指令相对应的前缀,并查找类似的路径组件.

As you've discovered (but with some more context for the people having the same issue) MSCK REPAIR TABLE … only understands Hive style partitioning, e.g. /data/year=2019/month=08/day=10/file.json. What the command really does is scan through the prefix on S3 corresponding to the table's LOCATION directive and look for path components that look like that.

这只是MSCK REPAIR TABLE …的限制,您可以手动添加具有其他路径样式的分区,例如:

This is just a limitation with MSCK REPAIR TABLE …, you can manually add partitions with other path styles like this:

ALTER TABLE the_table ADD PARTITION (year = '2019', month = '08', day = '10') LOCATION 's3://some-bucket/data/2019/08/10/'

另请参见 https://docs .aws.amazon.com/athena/latest/ug/alter-table-add-partition.html

我要说的是,您应该避免完全使用MSCK REPAIR TABLE ….它很慢,并且分区越多,速度就越慢.在S3上添加新数据时,运行ALTER TABLE … ADD PARTITION …效率更高,因为您知道刚刚添加的内容以及它的位置,因此不需要Athena扫描整个前缀.直接使用Glue API甚至更快,但是不幸的是,这是更多的代码.

I would go so far as to say that you should avoid using MSCK REPAIR TABLE … altogether. It's slow, and only gets slower the more partitions you have. It's much more efficient to run ALTER TABLE … ADD PARTITION … when you add new data on S3, because you know what you just added and where it is, so telling Athena to scan through your whole prefix is unnecessary. Even faster is using the Glue API directly, but that's more code, unfortunately.

这篇关于AWS Athena中的文件系统上缺少表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆