AWS Glue Spark作业无法支持带双引号的大写列名 [英] AWS Glue Spark Job Fails to Support Upper case Column Name with Double Quotes

查看:89
本文介绍了AWS Glue Spark作业无法支持带双引号的大写列名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题陈述/根本原因:我们正在使用AWS Glue将数据从生产PostGress数据库加载到AWS DataLake中.胶水在内部使用Spark作业来移动数据.但是,我们的ETL过程失败了,因为Spark只支持小写的表列名,不幸的是,我们所有的源PostGress表列名都在CamelCase中并用双引号引起来.

Problem Statement/Root Cause: We are using AWS Glue to load data from a production PostGress DB into AWS DataLake. Glue internally uses Spark job to move the data. Our ETL process is, however, failing as Spark only supports lowercase table column names and unfortunately, all our source PostGress table column names are in CamelCase and enclosed in double-quotes.

例如:PostGress数据库中我们的源表列名称为"CreatedDate".Spark作业查询正在查找createddate,但由于找不到列名而失败.因此,火花作业查询需要精确地查找"CreatedDate",以便能够从PostGress DB中移动数据.这似乎是Spark(仅支持小写的表列名称)和PostGress(使用双引号创建的列名称在余下的生命中都必须双引号)的固有限制.

E.g. : Our Source table column name in the PostGress DB is "CreatedDate". The Spark job query is looking for createddate and is failing because it can't find the column name. So, the spark job query needs to look for exactly "CreatedDate" to be able to move data from the PostGress DB. This seems to be an inherent limitation of both Spark (as it only supports lowercase table column names) and PostGress (Column names that were created with double-quotes have to be double-quoted for the rest of their life).

参考链接: https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html PostgreSQL列名是否区分大小写?

评估的解决方案:1.我们将无法将列名从CamelCase重命名为小写,因为这将需要在所有下游系统中进行更大的更改.2.我们正在尝试重写/调整Glue的自动生成的Spark代码,以查看是否可以使它与双引号,非小写的源表列名一起使用.

Solutions evaluated: 1. We will not be able to rename the column names from CamelCase to lowercase as that will necessitate a bigger change in all downstream systems. 2. We are trying to rewrite/tweak Glue's auto-generated Spark code to see if we can get it to work with double-quoted, non-lowercase source table column names.

以前有没有人遇到过这个问题,您是否尝试过调整自动生成的Spark代码以使其正常工作?

Has anyone run into this issue before and have you tried to tweak the auto-generated Spark code to get it working?

推荐答案

Sandeep Fatangare ,谢谢您的建议.

我对AWS Glue还是陌生的,我不知道我做的是否正确.如果我错了,请指导我.

I am very new to AWS Glue I don't know whether I'm doing correctly. Please guide me if I'm wrong.

我尝试通过导航到

AWS胶->作业,然后选择失败的作业脚本

AWS Glue -> Jobs and choose the failed Job script

在详细信息"选项卡中,它显示位置作业详细信息中提到的位置为s3://aws-glue-assets-us-east-1/scripts/glueetl/jdbc_incremental.py".

In the details tab, it shows the location "location mention in the job details is s3://aws-glue-assets-us-east-1/scripts/glueetl/jdbc_incremental.py".

然后在脚本标签中,我开始编辑脚本

And in Script Tab I start editing the script

上一个:

applymapping1 = ApplyMapping.apply(frame=datasource0, mappings=self.get_mappings(),                                                                                      transformation_ctx="applymapping1_" + self.source.table_name)

applymapping1 = ApplyMapping.apply(frame=datasource0, mappings=self.get_mappings(),
                                           caseSensitive : Boolean = false, 
                                           transformation_ctx="applymapping1_" + self.source.table_name)

我遇到了两个问题

  1. 我无法保存修改后的脚本
  2. 在运行脚本时,它告诉我工作流名称丢失

这篇关于AWS Glue Spark作业无法支持带双引号的大写列名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆