PySpark:调用o51.showString时发生错误.没有名为XXX的模块 [英] PySpark: An error occurred while calling o51.showString. No module named XXX

查看:57
本文介绍了PySpark:调用o51.showString时发生错误.没有名为XXX的模块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的pyspark版本是2.2.0.我遇到一个奇怪的问题.我尝试将其简化为以下内容.文件结构:

My pyspark version is 2.2.0. I came to a strange problem. I try to simplify it as the following. The files structure:

|root
|-- cast_to_float.py
|-- tests
    |-- test.py

cast_to_float.py 中,我的代码:

from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf

def cast_to_float(y, column_name):
    return y.withColumn(column_name, y[column_name].cast(FloatType()))

def cast_to_float_1(y, column_name):
    to_float = udf(cast2float1, FloatType())
    return y.withColumn(column_name, to_float(column_name))

def cast2float1(a):
    return 1.0

test.py 中:

from pyspark.sql import SparkSession
import os
import sys
parentPath = os.path.abspath('..')
if parentPath not in sys.path:
    sys.path.insert(0, parentPath)

from cast_to_float import *
spark = SparkSession.builder.appName("tests").getOrCreate()
df = spark.createDataFrame([
            (1, 1),
            (2, 2),
            (3, 3),
        ], ["ID", "VALUE"])
df1 = cast_to_float(df, 'ID')
df2 = cast_to_float_1(df, 'ID')

df1.show()
df1.printSchema()
df2.printSchema()
df2.show()

然后我在tests文件夹中运行测试,我得到了错误消息,该消息是从最后一行说的:

Then I run the test in tests folder, I get the error message, which is from the last line, saying:

+---+-----+
| ID|VALUE|
+---+-----+
|1.0|    1|
|2.0|    2|
|3.0|    3|
+---+-----+

root
 |-- ID: float (nullable = true)
 |-- VALUE: long (nullable = true)

root
 |-- ID: float (nullable = true)
 |-- VALUE: long (nullable = true)

    Py4JJavaError                             Traceback (most recent call last)
<ipython-input-4-86eb5df2f917> in <module>()
     19 df1.printSchema()
     20 df2.printSchema()
---> 21 df2.show()
...
Py4JJavaError: An error occurred while calling o257.showString.
...
ModuleNotFoundError: No module named 'cast_to_float'
...

似乎已导入 cast_to_float ,否则,我什至无法获得 df1 .

It seems the cast_to_float is imported, otherwise, I cannot get df1 even.

如果我将 test.py 放在 cast_to_float.py 的同一目录中,并在该目录中运行,那就可以了.有任何想法吗?谢谢!

If I put test.py in the same directory of cast_to_float.py, and run it in that directory, then it's OK. Any ideas? Thanks!

我使用@ user8371915 __ file __ 方法,发现如果我在 root 文件夹中运行它就可以了.

I used @user8371915 __file__ method, and found it's OK if I ran it in root folder.

推荐答案

目前,结果取决于您调用脚本的工作目录.

As it is right now, the result will depend on the working directory, where you invoke the script.

如果您是root用户,则会添加其父级.您应该使用相对于 __ file __ 的路径(请参见 __file__变量的含义/作用是什么?):

If you're in root, this will add its parent. You should use path relative to __file__ (see what does the __file__ variable mean/do?):

parentPath = os.path.join(
    os.path.abspath(os.path.dirname(__file__)), 
    os.path.pardir
)

但是我会建议使用正确的包装结构.

but I'd will recommend using proper package structure.

注意:

这仅涵盖本地模式和驱动程序路径,即使在本地模式下,工作程序路径也不受驱动程序路径的影响.

This covers only local mode and driver path and even in local mode, worker paths, are not affected by the driver path.

要处理执行程序路径(更改后您会获得执行程序异常),您仍应将模块分发给工作人员如何在Apache上使用自定义类火花(pyspark)?.

To handle executor paths (after changes you get executor exceptions) you should still distribute modules to the workers How to use custom classes with Apache Spark (pyspark)?.

spark = SparkSession.builder.appName("tests").getOrCreate()
spark.sparkContext.addPyFile("/path/to/cast_to_float.py")

这篇关于PySpark:调用o51.showString时发生错误.没有名为XXX的模块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆