AWS Glue Crawler是否为每个分区添加表? [英] AWS Glue Crawler adding tables for every partition?

查看:188
本文介绍了AWS Glue Crawler是否为每个分区添加表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在S3存储桶中以这种形式存储了数千个文件:

I have several thousand files in an S3 bucket in this form:

├── bucket
│   ├── somedata
│   │   ├── year=2016
│   │   ├── year=2017
│   │   │   ├── month=11
│   │   |   │   ├── sometype-2017-11-01.parquet
│   |   |   |   ├── sometype-2017-11-02.parquet
│   |   |   |   ├── ...
│   │   │   ├── month=12
│   │   |   │   ├── sometype-2017-12-01.parquet
│   |   |   |   ├── sometype-2017-12-02.parquet
│   |   |   |   ├── ...
│   │   ├── year=2018
│   │   │   ├── month=01
│   │   |   │   ├── sometype-2018-01-01.parquet
│   |   |   |   ├── sometype-2018-01-02.parquet
│   |   |   |   ├── ...
│   ├── moredata
│   │   ├── year=2017
│   │   │   ├── month=11
│   │   |   │   ├── moretype-2017-11-01.parquet
│   |   |   |   ├── moretype-2017-11-02.parquet
│   |   |   |   ├── ...
│   │   ├── year=...

预期的行为: AWS Glue爬网程序为某些数据,更多数据等中的每一个创建一个表.它根据子代的路径名为每个表创建分区.

Expected behavior: The AWS Glue Crawler creates one table for each of somedata, moredata, etc. It creates partitions for each table based on the childrens' path names.

实际行为: AWS Glue爬网程序执行上述行为,但ALSO会为数据的每个分区创建一个单独的表,从而产生数百个无关表(以及每个数据都会添加新的爬网的更多无关表).

Actual Behavior: The AWS Glue Crawler performs the behavior above, but ALSO creates a separate table for every partition of the data, resulting in several hundred extraneous tables (and more extraneous tables which every data add + new crawl).

我认为没有地方可以设置某些内容或以其他方式防止这种情况的发生...是否有人对防止创建这些不必要的表的最佳方法有任何建议?

I see no place to be able to set something or otherwise prevent this from happening... Does anyone have advice on the best way to prevent these unnecessary tables from being created?

推荐答案

检查内部是否有空文件夹.当spark将S3写入S3时,有时不会删除_temporary文件夹,这将使Glue搜寻器为每个分区创建表.

check if you have empty folders inside. When spark writes to S3, sometimes, the _temporary folder is not deleted, which will make Glue crawler create table for each partition.

这篇关于AWS Glue Crawler是否为每个分区添加表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆