pyspark'DataFrame'对象没有属性'_get_object_id' [英] pyspark 'DataFrame' object has no attribute '_get_object_id'
本文介绍了pyspark'DataFrame'对象没有属性'_get_object_id'的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试运行一些代码,但出现错误:
I am trying to run some code, but getting error:
"DataFrame"对象没有属性"_get_object_id"
'DataFrame' object has no attribute '_get_object_id'
代码:
items = [(1,12),(1,float('Nan')),(1,14),(1,10),(2,22),(2,20),(2,float('Nan')),(3,300),
(3,float('Nan'))]
sc = spark.sparkContext
rdd = sc.parallelize(items)
df = rdd.toDF(["id", "col1"])
import pyspark.sql.functions as func
means = df.groupby("id").agg(func.mean("col1"))
# The error is thrown at this line
df = df.withColumn("col1", func.when((df["col1"].isNull()), means.where(func.col("id")==df["id"])).otherwise(func.col("col1")))
推荐答案
除非使用联接,否则无法在函数内引用第二个Spark DataFrame.IIUC,您可以执行以下操作以获得所需的结果.
You can't reference a second spark DataFrame inside a function, unless you're using a join. IIUC, you can do the following to achieve your desired result.
假设表示
为以下内容:
#means.show()
#+---+---------+
#| id|avg(col1)|
#+---+---------+
#| 1| 12.0|
#| 3| 300.0|
#| 2| 21.0|
#+---+---------+
在 id
列上加入 df
和 means
,然后在 when
条件
Join df
and means
on the id
column, then apply your when
condition
from pyspark.sql.functions import when
df.join(means, on="id")\
.withColumn(
"col1",
when(
(df["col1"].isNull()),
means["avg(col1)"]
).otherwise(df["col1"])
)\
.select(*df.columns)\
.show()
#+---+-----+
#| id| col1|
#+---+-----+
#| 1| 12.0|
#| 1| 12.0|
#| 1| 14.0|
#| 1| 10.0|
#| 3|300.0|
#| 3|300.0|
#| 2| 21.0|
#| 2| 22.0|
#| 2| 20.0|
#+---+-----+
但是在这种情况下,我实际上建议使用 Window
和
But in this case, I'd actually recommend using a Window
with pyspark.sql.functions.mean
:
from pyspark.sql import Window
from pyspark.sql.functions import col, mean
df.withColumn(
"col1",
when(
col("col1").isNull(),
mean("col1").over(Window.partitionBy("id"))
).otherwise(col("col1"))
).show()
#+---+-----+
#| id| col1|
#+---+-----+
#| 1| 12.0|
#| 1| 10.0|
#| 1| 12.0|
#| 1| 14.0|
#| 3|300.0|
#| 3|300.0|
#| 2| 22.0|
#| 2| 20.0|
#| 2| 21.0|
#+---+-----+
这篇关于pyspark'DataFrame'对象没有属性'_get_object_id'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文