Pyspark 'NoneType' 对象没有属性 '_jvm' 错误 [英] Pyspark 'NoneType' object has no attribute '_jvm' error
问题描述
我试图使用 spark 2.2 打印 DataFrame 中每个分区中的总元素
from pyspark.sql.functions import *从 pyspark.sql 导入 SparkSessiondef count_elements(splitIndex, iterator):n = sum(1 for _ in iterator)产量(splitIndex,n)spark = SparkSession.builder.appName("tmp").getOrCreate()num_parts = 3df = spark.read.json("/tmp/tmp/gon_s.json").repartition(num_parts)print("df 有分区."+ str(df.rdd.getNumPartitions()))print("跨分区的元素为:" + str(df.rdd.mapPartitionsWithIndex(lambda ind, x: count_elements(ind, x)).take(3)))
上面的代码一直失败并出现以下错误
<块引用> n = sum(1 for _ in iterator)文件/home/dev/wk/pyenv/py3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/functions.py",第40行,在_jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)AttributeError: 'NoneType' 对象没有属性 '_jvm'
删除下面的导入后
from pyspark.sql.functions import *
代码工作正常
skewed_large_df 有 partitions.3跨分区的元素分布为:[(0, 1), (1, 2), (2, 2)]
是什么导致了这个错误,我该如何解决?
线
from pyspark.sql.functions import *
会将 pyspark.sql.functions
模块中的所有函数引入您的命名空间,包括一些会影响您的内置函数的函数.
具体问题在count_elements
函数就行:
n = sum(1 for _ in iterator)# ^^^ - 现在是 pyspark.sql.functions.sum
您打算调用 __builtin__.sum
,但是 import *
隐藏了内置函数.
改为执行以下操作之一:
import pyspark.sql.functions as f
或者
from pyspark.sql.functions import sum as sum_
I was trying to print total elements in each partitions in a DataFrame using spark 2.2
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
def count_elements(splitIndex, iterator):
n = sum(1 for _ in iterator)
yield (splitIndex, n)
spark = SparkSession.builder.appName("tmp").getOrCreate()
num_parts = 3
df = spark.read.json("/tmp/tmp/gon_s.json").repartition(num_parts)
print("df has partitions."+ str(df.rdd.getNumPartitions()))
print("Elements across partitions is:" + str(df.rdd.mapPartitionsWithIndex(lambda ind, x: count_elements(ind, x)).take(3)))
The Code above kept failing with following error
n = sum(1 for _ in iterator) File "/home/dev/wk/pyenv/py3/lib/python3.5/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 40, in _ jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col) AttributeError: 'NoneType' object has no attribute '_jvm'
after removing the import below
from pyspark.sql.functions import *
Code works fine
skewed_large_df has partitions.3
The distribution of elements across partitions is:[(0, 1), (1, 2), (2, 2)]
What is it causing this error and how can I fix it?
This is a great example of why you shouldn't use import *
.
The line
from pyspark.sql.functions import *
will bring in all the functions in the pyspark.sql.functions
module into your namespace, include some that will shadow your builtins.
The specific issue is in the count_elements
function on the line:
n = sum(1 for _ in iterator)
# ^^^ - this is now pyspark.sql.functions.sum
You intended to call __builtin__.sum
, but the import *
shadowed the builtin.
Instead, do one of the following:
import pyspark.sql.functions as f
Or
from pyspark.sql.functions import sum as sum_
这篇关于Pyspark 'NoneType' 对象没有属性 '_jvm' 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!