PySpark 2.1:使用 UDF 导入模块会破坏 Hive 连接 [英] PySpark 2.1: Importing module with UDF's breaks Hive connectivity

查看:18
本文介绍了PySpark 2.1:使用 UDF 导入模块会破坏 Hive 连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用 Spark 2.1,并且有一个主脚本调用包含我所有转换方法的辅助模块.换句话说:

I'm currently working with Spark 2.1 and have a main script that calls a helper module that contains all my transformation methods. In other words:

main.py
helper.py

在我的 helper.py 文件的顶部,我有几个自定义 UDF,我已按以下方式定义:

At the top of my helper.py file I have several custom UDFs that I have defined in the following manner:

def reformat(s):
  return reformat_logic(s)
reformat_udf = udf(reformat, StringType())

在我将所有 UDF 拆分到帮助文件中之前,我能够使用 spark.sql('sql statement') 通过我的 SparkSession 对象连接到我的 Hive 元存储.但是,在我将 UDF 移动到帮助程序文件并在主脚本顶部导入该文件后,SparkSession 对象无法再连接到 Hive 并返回到默认的 Derby 数据库.我在尝试查询我的 Hive 表时也会出错,例如 Hive 支持需要插入到以下表中...

Before I broke off all the UDFs into the helper file, I was able to connect to my Hive metastore through my SparkSession object using spark.sql('sql statement'). However, after I moved the UDFs to the helper file and imported that file at the top of my main script, the SparkSession object could no longer connect to Hive and went back to the default Derby database. I also get errors when trying to query my Hive tables such as Hive support is required to insert into the following tables...

我已经能够通过将我的 UDF 移动到一个完全独立的文件中并仅在需要它们的函数中运行该模块的导入语句来解决我的问题(不确定这是否是好的做法,但它有效).无论如何,有没有人理解为什么我在 Spark 和 UDF 方面看到如此奇特的行为?有谁知道跨应用程序共享 UDF 的好方法吗?

I've been able to solve my issue by moving my UDFs into a completely separate file and only running the import statements for that module inside the functions that need them (not sure if this is good practice, but it works). Anyway, does anyone understand why I'm seeing such peculiar behavior when it comes to Spark and UDFs? And does anyone know a good way to share UDFs across applications?

推荐答案

在 Spark 2.2.0 之前 UserDefinedFunction 急切地创建 UserDefinedPythonFunction 对象,该对象代表 JVM 上的 Python UDF.此过程需要访问 SparkContextSparkSession.如果在调用 UserDefinedFunction.__init__ 时没有活动实例,Spark 将自动为您初始化上下文.

Prior to Spark 2.2.0 UserDefinedFunction eagerly creates UserDefinedPythonFunction object, which represents Python UDF on JVM. This process requires access to SparkContext and SparkSession. If there are no active instances when UserDefinedFunction.__init__ is called, Spark will automatically initialize the contexts for you.

当您在导入 UserDefinedFunction 对象后调用 SparkSession.Builder.getOrCreate 时,它会返回现有的 SparkSession 实例,并且只能应用一些配置更改(enableHiveSupport 不在其中.

When you call SparkSession.Builder.getOrCreate after importing UserDefinedFunction object it returns existing SparkSession instance and only some configuration changes can be applied (enableHiveSupport is not among these).

要解决这个问题,您应该在导入 UDF 之前初始化 SparkSession:

To address this problem you should initialize SparkSession before you import UDF:

from pyspark.sql.session import SparkSession

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

from helper import reformat_udf

此行为在 SPARK-19163 中有所描述,并在 Spark 2.2 中修复.0.其他 API 改进包括装饰器语法(SPARK-19160)和改进的文档字符串处理(SPARK-19161).

This behavior is described in SPARK-19163 and fixed in Spark 2.2.0. Other API improvements include decorator syntax (SPARK-19160) and improved docstrings handling (SPARK-19161).

这篇关于PySpark 2.1:使用 UDF 导入模块会破坏 Hive 连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆