PySpark:调用 o51.showString 时出错.没有名为 XXX 的模块 [英] PySpark: An error occurred while calling o51.showString. No module named XXX

查看:20
本文介绍了PySpark:调用 o51.showString 时出错.没有名为 XXX 的模块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 pyspark 版本是 2.2.0.我遇到了一个奇怪的问题.我尝试将其简化如下.文件结构:

My pyspark version is 2.2.0. I came to a strange problem. I try to simplify it as the following. The files structure:

|root
|-- cast_to_float.py
|-- tests
    |-- test.py

cast_to_float.py中,我的代码:

from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf

def cast_to_float(y, column_name):
    return y.withColumn(column_name, y[column_name].cast(FloatType()))

def cast_to_float_1(y, column_name):
    to_float = udf(cast2float1, FloatType())
    return y.withColumn(column_name, to_float(column_name))

def cast2float1(a):
    return 1.0

test.py 中:

from pyspark.sql import SparkSession
import os
import sys
parentPath = os.path.abspath('..')
if parentPath not in sys.path:
    sys.path.insert(0, parentPath)

from cast_to_float import *
spark = SparkSession.builder.appName("tests").getOrCreate()
df = spark.createDataFrame([
            (1, 1),
            (2, 2),
            (3, 3),
        ], ["ID", "VALUE"])
df1 = cast_to_float(df, 'ID')
df2 = cast_to_float_1(df, 'ID')

df1.show()
df1.printSchema()
df2.printSchema()
df2.show()

然后我在测试文件夹中运行测试,我收到错误消息,来自最后一行,说:

Then I run the test in tests folder, I get the error message, which is from the last line, saying:

+---+-----+
| ID|VALUE|
+---+-----+
|1.0|    1|
|2.0|    2|
|3.0|    3|
+---+-----+

root
 |-- ID: float (nullable = true)
 |-- VALUE: long (nullable = true)

root
 |-- ID: float (nullable = true)
 |-- VALUE: long (nullable = true)

    Py4JJavaError                             Traceback (most recent call last)
<ipython-input-4-86eb5df2f917> in <module>()
     19 df1.printSchema()
     20 df2.printSchema()
---> 21 df2.show()
...
Py4JJavaError: An error occurred while calling o257.showString.
...
ModuleNotFoundError: No module named 'cast_to_float'
...

看来cast_to_float 是导入的,否则,我什至无法获得df1.

It seems the cast_to_float is imported, otherwise, I cannot get df1 even.

如果我把test.py放在cast_to_float.py的同一个目录下,然后在那个目录下运行,就OK了.有任何想法吗?谢谢!

If I put test.py in the same directory of cast_to_float.py, and run it in that directory, then it's OK. Any ideas? Thanks!

我使用了@user8371915 __file__ 方法,发现如果我在root 文件夹中运行它就可以了.

I used @user8371915 __file__ method, and found it's OK if I ran it in root folder.

推荐答案

就目前而言,结果将取决于调用脚本的工作目录.

As it is right now, the result will depend on the working directory, where you invoke the script.

如果你是 root,这将添加它的父级.您应该使用相对于 __file__ 的路径(参见 __file__ 变量是什么意思/做什么?):

If you're in root, this will add its parent. You should use path relative to __file__ (see what does the __file__ variable mean/do?):

parentPath = os.path.join(
    os.path.abspath(os.path.dirname(__file__)), 
    os.path.pardir
)

但我会建议使用适当的包结构.

but I'd will recommend using proper package structure.

注意:

这仅涵盖本地模式和驱动程序路径,即使在本地模式下,工作路径也不受驱动程序路径的影响.

This covers only local mode and driver path and even in local mode, worker paths, are not affected by the driver path.

要处理执行程序路径(更改后您会收到执行程序异常),您仍应将模块分发给工作人员 如何在 Apache 中使用自定义类火花(pyspark)?.

To handle executor paths (after changes you get executor exceptions) you should still distribute modules to the workers How to use custom classes with Apache Spark (pyspark)?.

spark = SparkSession.builder.appName("tests").getOrCreate()
spark.sparkContext.addPyFile("/path/to/cast_to_float.py")

这篇关于PySpark:调用 o51.showString 时出错.没有名为 XXX 的模块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆