如何访问分区的Athena表的子目录中的数据 [英] How to access data in subdirectories for partitioned Athena table

查看：61 发布时间：2021/4/13 18:35:10 aws-glue aws-glue-data-catalog

本文介绍了如何访问分区的Athena表的子目录中的数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个每天都有分区的Athena表，其中实际文件按小时在子目录"中，如下所示:

I have an Athena table with a partition for each day, where the actual files are in "sub-directories" by hour, as follows:

s3://my-bucket/data/2019/06/27/00/00001.json
s3://my-bucket/data/2019/06/27/00/00002.json
s3://my-bucket/data/2019/06/27/01/00001.json
s3://my-bucket/data/2019/06/27/01/00002.json

Athena可以毫无问题地查询该表并找到我的数据，但是在使用AWS Glue时，它似乎无法找到该数据.

Athena is able to query this table without issue and find my data, but when using AWS Glue, it does not appear to be able to find this data.

ALTER TABLE mytable ADD 
PARTITION (year=2019, month=06, day=27) LOCATION 's3://my-bucket/data/2019/06/27/01';

select day, count(*)
from mytable
group by day;

day .   count
27 .    145431

我已经尝试将分区的位置更改为以斜杠结尾( s3://my-bucket/data/2019/06/27/01/)，但这没有帮助.

I've already tried changing the location of the partition to end with a trailing slash (s3://my-bucket/data/2019/06/27/01/), but this didn't help.

以下是Glue中的分区属性.我希望storedAsSubDirectories设置可以告诉它迭代子目录，但是事实并非如此:

Below are the partition properties in Glue. I was hoping that the storedAsSubDirectories setting would tell it to iterate the sub-directories, but this does not appear to be the case:

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "userid",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "labels",
                    "type": "array<string>",
                    "comment": ""
                }
            ]
        },
        "location": "s3://my-bucket/data/2019/06/27/01/",
        "inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "0",
        "SerDeInfo": {
            "name": "JsonSerDe",
            "serializationLib": "org.openx.data.jsonserde.JsonSerDe",
            "parameters": {
                "serialization.format": "1"
            }
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {},
        "SkewedInfo": {
            "skewedColNames": [],
            "skewedColValues": [],
            "skewedColValueLocationMaps": {}
        },
        "storedAsSubDirectories": "true"
    },
    "parameters": {}
}

当Glue在相同的分区/表上运行时，它将发现0行.

When Glue runs against this same partition/table, it finds 0 rows.

但是，如果所有数据文件都出现在分区的根目录"中(即s3://my-bucket/data/2019/06/27/00001.json)，则Athena和Glue都可以找到数据.

However, if all the data files appear in the root "directory" of the partition (i.e. s3://my-bucket/data/2019/06/27/00001.json), then both Athena and Glue can find the data.

是否有某些原因导致Glue无法找到数据文件?我不希望每个小时都创建一个分区，因为那将意味着每年8700个分区(而Athena的每个表限制为20,000个分区).

Is there some reason why Glue is unable to find the data files? I'd prefer not to create a partition for each hour, since that will mean 8700 partitions per year (and Athena has a limit of 20,000 partitions per table).

如何访问分区的Athena表的子目录中的数据 [英] How to access data in subdirectories for partitioned Athena table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何访问分区的Athena表的子目录中的数据 [英] How to access data in subdirectories for partitioned Athena table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭