PySpark:ModuleNotFoundError:没有名为"app"的模块 [英] PySpark: ModuleNotFoundError: No module named 'app'

查看:102
本文介绍了PySpark:ModuleNotFoundError:没有名为"app"的模块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下语句将数据框保存到PySpark中的CSV文件中:

I am saving a dataframe to a CSV file in PySpark using below statement:

df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite')

但是我遇到了错误

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command
command = serializer._read_with_length(file)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'app'

我正在使用PySpark 2.3.0版

i am using PySpark version 2.3.0

尝试写入文件时出现此错误.

I am getting this error while trying to write to a file.

    import json, jsonschema
    from pyspark.sql import functions
    from pyspark.sql.functions import udf
    from pyspark.sql.types import IntegerType, StringType, FloatType
    from datetime import datetime
    import os

    feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13)
    apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15)

    df_all = feb.union(apr)
    df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"])

    create_emi_amount_udf = udf(create_emi_amount, FloatType())
    df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type'))

    df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')

推荐答案

错误非常明显,没有模块'app'.您的Python代码在驱动程序上运行,但您的udf在执行程序PVM上运行.当您调用 udf 时,spark会序列化 create_emi_amount 并将其发送给执行者.

The error is very clear, there is not the module 'app'. Your Python code runs on driver, but you udf runs on executor PVM. When you call the udf, spark serializes the create_emi_amount to sent it to the executors.

因此,在方法 create_emi_amount 中的某个位置,您将使用或导入应用程序模块.解决问题的方法是在驱动程序和执行程序中使用相同的环境.在 spark-env.sh 中,在 PYSPARK_DRIVER_PYTHON = ... PYSPARK_PYTHON = ... 中设置保存的Python virtualenv.

So, somewhere in your method create_emi_amount you use or import the app module. A solution to your problem is to use the same environment in both driver and executors. In spark-env.sh set the save Python virtualenv in PYSPARK_DRIVER_PYTHON=... and PYSPARK_PYTHON=....

这篇关于PySpark:ModuleNotFoundError:没有名为"app"的模块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆