AWS Glue Crawler无法解析大文件(分类为UNKNOWN) [英] AWS Glue Crawler cannot parse large files (classification UNKNOWN)

查看：110 发布时间：2020/8/22 21:55:50 json amazon-web-services amazon-s3 aws-glue amazon-athena

本文介绍了AWS Glue Crawler无法解析大文件(分类为UNKNOWN)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在尝试使用AWS Glue的搜寻器来尝试获取特定json文件的列和其他功能.

I've been working on trying to use the crawler from AWS Glue to try to obtain the columns and other features of a certain json file.

我已经通过将json文件转换为UTF-8并使用boto3将其移至s3容器并从搜寻器访问该容器的方式在本地解析了该文件.

I've parsed the json file locally by converting it to UTF-8 and using boto3 to move it into an s3 container and accessing that container from the crawler.

我用自定义分类器$ [*]创建了一个json分类器，并使用常规设置创建了一个搜寻器.

I created a json classifier with the custom classifier $[*] and created a crawler with normal settings.

当我使用相对较小(小于50 Kb)的文件执行此操作时，搜寻器会正确识别列以及主json中内部json层的内部架构. 但是，我要处理的文件(大约1 Gb)，搜寻器的分类为"UNKNOWN"，无法识别任何列，因此无法查询.

When I do this with a file that is relatively small (<50 Kb) the crawler correctly identifies the columns as well as the internal schema of the inner json layers within the main json. However, the file that I am trying to do with (around 1 Gb), the crawler has "UNKNOWN" as the classification and cannot identify any columns and thus I cannot query it.

关于此问题的任何想法或解决方法?

Any ideas for the issue or some kind of work around?

我最终试图将其转换为Parquet格式，并使用Athena进行一些查询.

I am ultimately trying to convert it to a Parquet format and doing some querying with Athena.

我查看了后续帖子，但是此解决方案不起作用.我已经尝试过重写我的分类器和搜寻器.我还假定这些不是核心问题，因为我尝试使用$ [*]作为自定义分类器并使用几乎相同的设置，同时尝试对较小的文件执行相同的操作.

I've looked at the following post but this solution did not work. I've already tried rewriting my classifier and crawler. I also presume that these are not the core problems because I used $[*] as my custom classifier and used practically identical settings while trying to do this with the smaller file with the same result.

我开始认为原因仅仅是因为文件大.

I'm beginning to think that the reason is just because of the large file size.

AWS Glue Crawler无法解析大文件(分类为UNKNOWN) [英] AWS Glue Crawler cannot parse large files (classification UNKNOWN)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

AWS Glue Crawler无法解析大文件(分类为UNKNOWN) [英] AWS Glue Crawler cannot parse large files (classification UNKNOWN)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭