使用AWS Glue Crawler指定SerDe序列化库 [英] Specify a SerDe serialization lib with AWS Glue Crawler

查看:63
本文介绍了使用AWS Glue Crawler指定SerDe序列化库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每次我对现有数据运行粘合搜寻器时,它会将Serde序列化库更改为 LazySimpleSerDe ,这无法正确分类(例如,对于带逗号的引用字段)

Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe, which doesn't classify correctly (e.g. for quoted fields with commas in)

然后,我需要在胶水目录"中手动编辑表格详细信息,以将其更改为 org.apache.hadoop.hive.serde2.OpenCSVSerde .

I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde.

我尝试制作自己的csv分类器,但这无济于事.

I've tried making my own csv Classifier but that doesn't help.

如何让搜寻器为生成或更新的表指定特定的序列化库?

How do I get the crawler to specify a particular serialization lib for the tables produced or updated?

推荐答案

您目前无法在胶履带"中指定SerDe,但这是一种解决方法...

You can't specify the SerDe in the Glue Crawler at this time but here is a workaround...

  1. 使用以下配置创建胶履带.

  1. Create a Glue Crawler with the following configuration.

启用仅添加新列"-在发现新列时将其添加,但不会删除或更改数据目录中现有列的类型

Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog

启用使用表中的元数据更新所有新分区和现有分区"-此选项从其父表继承元数据属性,例如其分类,输入格式,输出格式,SerDe信息和架构.对表中这些属​​性的任何更改都将传播到其分区.

Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions.

运行搜寻器以创建表,它将使用"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"创建表.-将其编辑到"org.apache.hadoop.hive.serde2.OpenCSVSerde".

Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde".

重新运行搜寻器.

如果在重新运行搜寻器时添加了新分区,则也会使用"org.apache.hadoop.hive.serde2.OpenCSVSerde"来创建该分区.

In case a new partition is added on crawler re-run, it will also be created with "org.apache.hadoop.hive.serde2.OpenCSVSerde".

您现在应该具有一个设置为org.apache.hadoop.hive.serde2.OpenCSVSerde的表,并且不会重置.

You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset.

这篇关于使用AWS Glue Crawler指定SerDe序列化库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆