AWS Glue Crawler将json文件分类为UNKNOWN [英] AWS Glue Crawler Classifies json file as UNKNOWN

查看:320
本文介绍了AWS Glue Crawler将json文件分类为UNKNOWN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一项ETL作业,该作业将JSON文件提取到RDS登台表中.我配置的抓取工具可以对JSON文件进行分类,只要它们的大小小于1MB.如果我缩小文件(而不是漂亮的打印文件),并且结果小于1MB,它将对文件进行分类而不会出现问题.

I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.

我在想办法时遇到了麻烦.我尝试将JSON转换为BSON或GZIP转换JSON文件,但仍被归类为UNKNOWN.

I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.

还有其他人遇到这个问题吗?有一个更好的方法吗?

Has anyone else run into this issue? Is there a better way to do this?

推荐答案

我有两个json文件,分别为42mb和16mb,在S3上作为路径分区:

I have two json files which are 42mb and 16mb, partitioned on S3 as path:

  • s3://bucket/stg/year/month/_0.json

  • s3://bucket/stg/year/month/_0.json

s3://bucket/stg/year/month/_1.json

s3://bucket/stg/year/month/_1.json

我遇到了与您相同的问题,搜寻器分类为未知".

I had the same problem as you, crawler classification as UNKNOWN.

我能够解决它:

  • 您必须使用jsonPath作为"$ [*]"创建自定义分类器,然后使用该分类器创建新的搜寻器.
  • 使用S3上的数据运行新的搜寻器,将创建正确的架构.
  • 请勿使用分类器更新当前的搜寻器,因为它不会应用更改,我不知道为什么,也许是因为他们的文档中提到了分类器版本化AWS.创建新的搜寻器以使其正常工作

这篇关于AWS Glue Crawler将json文件分类为UNKNOWN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆