PySpark.将数据框传递给pandas_udf并返回一系列 [英] PySpark. Passing a Dataframe to a pandas_udf and returning a series
问题描述
我正在使用PySpark的新pandas_udf
装饰器,并且试图将其接受多列作为输入并返回一系列作为输入,但是,我得到了TypeError: Invalid argument
I'm using PySpark's new pandas_udf
decorator and I'm trying to get it to take multiple columns as an input and return a series as an input, however, I get a TypeError: Invalid argument
示例代码
@pandas_udf(df.schema, PandasUDFType.SCALAR)
def fun_function(df_in):
df_in.loc[df_in['a'] < 0] = 0.0
return (df_in['a'] - df_in['b']) / df_in['c']
推荐答案
A SCALAR udf expects pandas series as input instead of a data frame. For your case, there's no need to use a udf. Direct calculation from columns a
, b
, c
after clipping should work:
import pyspark.sql.functions as f
df = spark.createDataFrame([[1,2,4],[-1,2,2]], ['a', 'b', 'c'])
clip = lambda x: f.when(df.a < 0, 0).otherwise(x)
df.withColumn('d', (clip(df.a) - clip(df.b)) / clip(df.c)).show()
#+---+---+---+-----+
#| a| b| c| d|
#+---+---+---+-----+
#| 1| 2| 4|-0.25|
#| -1| 2| 2| null|
#+---+---+---+-----+
如果必须使用pandas_udf
,则返回类型必须为double
,而不是df.schema
,因为您只返回熊猫系列而不是熊猫数据框;而且您还需要将列作为Series传递给函数,而不是整个数据帧:
And if you have to use a pandas_udf
, your return type needs to be double
, not df.schema
because you only return a pandas series not a pandas data frame; And also you need to pass columns as Series into the function not the whole data frame:
@pandas_udf('double', PandasUDFType.SCALAR)
def fun_function(a, b, c):
clip = lambda x: x.where(a >= 0, 0)
return (clip(a) - clip(b)) / clip(c)
df.withColumn('d', fun_function(df.a, df.b, df.c)).show()
#+---+---+---+-----+
#| a| b| c| d|
#+---+---+---+-----+
#| 1| 2| 4|-0.25|
#| -1| 2| 2| null|
#+---+---+---+-----+
这篇关于PySpark.将数据框传递给pandas_udf并返回一系列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!