AWS Glue Crawler无法解析大文件(分类为UNKNOWN) [英] AWS Glue Crawler cannot parse large files (classification UNKNOWN)

查看:110
本文介绍了AWS Glue Crawler无法解析大文件(分类为UNKNOWN)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试使用AWS Glue的搜寻器来尝试获取特定json文件的列和其他功能.

I've been working on trying to use the crawler from AWS Glue to try to obtain the columns and other features of a certain json file.

我已经通过将json文件转换为UTF-8并使用boto3将其移至s3容器并从搜寻器访问该容器的方式在本地解析了该文件.

I've parsed the json file locally by converting it to UTF-8 and using boto3 to move it into an s3 container and accessing that container from the crawler.

我用自定义分类器$ [*]创建了一个json分类器,并使用常规设置创建了一个搜寻器.

I created a json classifier with the custom classifier $[*] and created a crawler with normal settings.

当我使用相对较小(小于50 Kb)的文件执行此操作时,搜寻器会正确识别列以及主json中内部json层的内部架构. 但是,我要处理的文件(大约1 Gb),搜寻器的分类为"UNKNOWN",无法识别任何列,因此无法查询.

When I do this with a file that is relatively small (<50 Kb) the crawler correctly identifies the columns as well as the internal schema of the inner json layers within the main json. However, the file that I am trying to do with (around 1 Gb), the crawler has "UNKNOWN" as the classification and cannot identify any columns and thus I cannot query it.

关于此问题的任何想法或解决方法?

Any ideas for the issue or some kind of work around?

我最终试图将其转换为Parquet格式,并使用Athena进行一些查询.

I am ultimately trying to convert it to a Parquet format and doing some querying with Athena.

我查看了后续帖子,但是此解决方案不起作用.我已经尝试过重写我的分类器和搜寻器.我还假定这些不是核心问题,因为我尝试使用$ [*]作为自定义分类器并使用几乎相同的设置,同时尝试对较小的文件执行相同的操作.

I've looked at the following post but this solution did not work. I've already tried rewriting my classifier and crawler. I also presume that these are not the core problems because I used $[*] as my custom classifier and used practically identical settings while trying to do this with the smaller file with the same result.

我开始认为原因仅仅是因为文件大.

I'm beginning to think that the reason is just because of the large file size.

推荐答案

我可能是错的,但是可以处理的文件大小存在一定的限制.尝试将大文件拆分为10Mb(建议大小).搜寻器将并行处理这些文件,并且当您再次运行它时,它将仅处理已更改/新的文件.抱歉,我找不到相关的aws文档,请尝试一下,看看它是否可以工作

I might be wrong, but there is sort of limit for file size that could be processed. Try to split your big file into files 10Mb(it's recommended size). Crawler will process those files in parallel and when you run it again, it will process only changed/new files. Sorry, I couldn't find related aws documentation, just try it out and see if it will work

这篇关于AWS Glue Crawler无法解析大文件(分类为UNKNOWN)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆