AWS Glue爬网程序创建分区和文件表 [英] AWS Glue Crawler Creates Partition and File Tables

查看：271 发布时间：2020/9/15 19:11:01 amazon-web-services amazon-s3 amazon-athena aws-glue

本文介绍了AWS Glue爬网程序创建分区和文件表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常基本的s3设置，我想使用Athena进行查询.数据全部存储在一个存储桶中，并分为年/月/日/小时文件夹.

I have a pretty basic s3 setup that I would like to query against using Athena. The data is all stored in one bucket, organized into year/month/day/hour folders.

|--data
|   |--2018
|   |   |--01
|   |   |   |--01
|   |   |   |   |--01
|   |   |   |   |   |--file1.json
|   |   |   |   |   |--file2.json
|   |   |   |   |--02
|   |   |   |   |   |--file3.json
|   |   |   |   |   |--file4.json
...

然后我设置一个AWS Glue爬网程序来爬网s3://bucket/data.所有文件中的架构都是相同的.我希望我会得到一个数据库表，并在年，月，日等上进行分区.

I then setup an AWS Glue Crawler to crawl s3://bucket/data. The schema in all files is identical. I would expect that I would get one database table, with partitions on the year, month, day, etc.

我得到的是成千上万张表.每个文件都有一个表，每个父分区也都有一个表.据我所知，为每个文件/文件夹创建了单独的表，而没有一个总体表，可以在较大的日期范围内进行查询.

What I get instead are tens of thousands of tables. There is a table for each file, and a table for each parent partition as well. So far as I can tell, separate tables were created for each file/folder, without a single overarching one where I can query across a large date range.

我按照说明 https://docs.aws.amazon .com/glue/latest/dg/crawler-configuration.html 尽我所能，但是却无法弄清楚如何构造分区/扫描，以至于我不会得到这么大的，几乎毫无用处的转储数据.

I followed instructions https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html to the best of my ability, but cannot figure out how to structure my partitions/scanning such that I don't get this huge, mostly worthless dump of data.

AWS Glue爬网程序创建分区和文件表 [英] AWS Glue Crawler Creates Partition and File Tables

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

AWS Glue爬网程序创建分区和文件表 [英] AWS Glue Crawler Creates Partition and File Tables

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭