从EMR迁移到AWS Glue后在Spark SQL中找不到表 [英] Tables not found in Spark SQL after migrating from EMR to AWS Glue

查看:181
本文介绍了从EMR迁移到AWS Glue后在Spark SQL中找不到表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在EMR上有Spark作业,并且将EMR配置为对Hive和Spark元数据使用Glue目录.

I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata.

我创建了Hive外部表,它们出现在Glue目录中,我的Spark作业可以在Spark SQL中像spark.sql("select * from hive_table ...")

I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select * from hive_table ...")

现在,当我尝试在Glue作业中运行相同的代码时,它将失败,并显示找不到表"错误.看来Glue作业没有使用Spark SQL在EMR中运行的相同方式使用Spark SQL的Glue目录.

Now, when I try to run the same code in a Glue job, it fails with "table not found" error. It looks like Glue jobs are not using the Glue catalog for Spark SQL the same way that Spark SQL would running in EMR.

我可以通过使用Glue API并将数据帧注册为临时视图来解决此问题:

I can work around this by using Glue APIs and registering dataframes as temp views:

create_dynamic_frame_from_catalog(...).toDF().createOrReplaceTempView(...)

但是有一种自动执行此操作的方法吗?

but is there a way to do this automatically?

推荐答案

这是一个期待已久的功能请求(将Glue数据目录与Glue ETL作业一起使用),最近已发布. 创建新工作时,您会发现以下选项

This was a much awaited feature request (to use Glue Data Catalog with Glue ETL jobs) which has been release recently. When you create a new job, you'll find the following option

Use Glue data catalog as the Hive metastore

您还可以通过编辑作业并在不提供值的作业参数中添加--enable-glue-datacatalog来为现有作业启用该功能

You may also enable it for an existing job by editing the job and adding --enable-glue-datacatalog in the job parameters providing no value

这篇关于从EMR迁移到AWS Glue后在Spark SQL中找不到表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆