使用AWS Glue Crawler指定SerDe序列化库 [英] Specify a SerDe serialization lib with AWS Glue Crawler
问题描述
每次我对现有数据运行粘合搜寻器时,它会将Serde序列化库更改为 LazySimpleSerDe
,这无法正确分类(例如,对于带逗号的引用字段)
Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe
, which doesn't classify correctly (e.g. for quoted fields with commas in)
然后,我需要在胶水目录"中手动编辑表格详细信息,以将其更改为 org.apache.hadoop.hive.serde2.OpenCSVSerde
.
I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde
.
我尝试制作自己的csv分类器,但这无济于事.
I've tried making my own csv Classifier but that doesn't help.
如何让搜寻器为生成或更新的表指定特定的序列化库?
How do I get the crawler to specify a particular serialization lib for the tables produced or updated?
推荐答案
您目前无法在胶履带"中指定SerDe,但这是一种解决方法...
You can't specify the SerDe in the Glue Crawler at this time but here is a workaround...
-
使用以下配置创建胶履带.
Create a Glue Crawler with the following configuration.
启用仅添加新列"-在发现新列时将其添加,但不会删除或更改数据目录中现有列的类型
Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog
启用使用表中的元数据更新所有新分区和现有分区"-此选项从其父表继承元数据属性,例如其分类,输入格式,输出格式,SerDe信息和架构.对表中这些属性的任何更改都将传播到其分区.
Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions.
运行搜寻器以创建表,它将使用"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"创建表.-将其编辑到"org.apache.hadoop.hive.serde2.OpenCSVSerde".
Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde".
重新运行搜寻器.
如果在重新运行搜寻器时添加了新分区,则也会使用"org.apache.hadoop.hive.serde2.OpenCSVSerde"来创建该分区.
In case a new partition is added on crawler re-run, it will also be created with "org.apache.hadoop.hive.serde2.OpenCSVSerde".
您现在应该具有一个设置为org.apache.hadoop.hive.serde2.OpenCSVSerde的表,并且不会重置.
You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset.
这篇关于使用AWS Glue Crawler指定SerDe序列化库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!