AWS Glue Crawler将json文件分类为UNKNOWN [英] AWS Glue Crawler Classifies json file as UNKNOWN
问题描述
我正在从事一项ETL作业,该作业将JSON文件提取到RDS登台表中.我配置的抓取工具可以对JSON文件进行分类,只要它们的大小小于1MB.如果我缩小文件(而不是漂亮的打印文件),并且结果小于1MB,它将对文件进行分类而不会出现问题.
I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.
我在想办法时遇到了麻烦.我尝试将JSON转换为BSON或GZIP转换JSON文件,但仍被归类为UNKNOWN.
I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.
还有其他人遇到这个问题吗?有一个更好的方法吗?
Has anyone else run into this issue? Is there a better way to do this?
推荐答案
我有两个json文件,分别为42mb和16mb,在S3上作为路径分区:
I have two json files which are 42mb and 16mb, partitioned on S3 as path:
-
s3://bucket/stg/year/month/_0.json
s3://bucket/stg/year/month/_0.json
s3://bucket/stg/year/month/_1.json
s3://bucket/stg/year/month/_1.json
我遇到了与您相同的问题,搜寻器分类为未知".
I had the same problem as you, crawler classification as UNKNOWN.
我能够解决它:
- 您必须使用jsonPath作为"$ [*]"创建自定义分类器,然后使用该分类器创建新的搜寻器.
- 使用S3上的数据运行新的搜寻器,将创建正确的架构.
- 请勿使用分类器更新当前的搜寻器,因为它不会应用更改,我不知道为什么,也许是因为他们的文档中提到了分类器版本化AWS.创建新的搜寻器以使其正常工作
这篇关于AWS Glue Crawler将json文件分类为UNKNOWN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!