从EMR迁移到AWS Glue后在Spark SQL中找不到表 [英] Tables not found in Spark SQL after migrating from EMR to AWS Glue
问题描述
我在EMR上有Spark作业,并且将EMR配置为对Hive和Spark元数据使用Glue目录.
I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata.
我创建了Hive外部表,它们出现在Glue目录中,我的Spark作业可以在Spark SQL中像spark.sql("select * from hive_table ...")
I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select * from hive_table ...")
现在,当我尝试在Glue作业中运行相同的代码时,它将失败,并显示找不到表"错误.看来Glue作业没有使用Spark SQL在EMR中运行的相同方式使用Spark SQL的Glue目录.
Now, when I try to run the same code in a Glue job, it fails with "table not found" error. It looks like Glue jobs are not using the Glue catalog for Spark SQL the same way that Spark SQL would running in EMR.
我可以通过使用Glue API并将数据帧注册为临时视图来解决此问题:
I can work around this by using Glue APIs and registering dataframes as temp views:
create_dynamic_frame_from_catalog(...).toDF().createOrReplaceTempView(...)
但是有一种自动执行此操作的方法吗?
but is there a way to do this automatically?
推荐答案
这是一个期待已久的功能请求(将Glue数据目录与Glue ETL作业一起使用),最近已发布. 创建新工作时,您会发现以下选项
This was a much awaited feature request (to use Glue Data Catalog with Glue ETL jobs) which has been release recently. When you create a new job, you'll find the following option
Use Glue data catalog as the Hive metastore
您还可以通过编辑作业并在不提供值的作业参数中添加--enable-glue-datacatalog
来为现有作业启用该功能
You may also enable it for an existing job by editing the job and adding --enable-glue-datacatalog
in the job parameters providing no value
这篇关于从EMR迁移到AWS Glue后在Spark SQL中找不到表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!