PySpark按行功能组合 [英] PySpark row-wise function composition
问题描述
作为一个简化的示例,我有一个数据框"df",其列为"col1,col2",我想在对每个列应用函数后计算按行的最大值:
As a simplified example, I have a dataframe "df" with columns "col1,col2" and I want to compute a row-wise maximum after applying a function to each column :
def f(x):
return (x+1)
max_udf=udf(lambda x,y: max(x,y), IntegerType())
f_udf=udf(f, IntegerType())
df2=df.withColumn("result", max_udf(f_udf(df.col1),f_udf(df.col2)))
所以,如果是df:
col1 col2
1 2
3 0
然后
df2:
col1 col2 result
1 2 3
3 0 4
以上方法似乎无效,并产生无法计算表达式:PythonUDF#f ..."
The above doesn't seem to work and produces "Cannot evaluate expression: PythonUDF#f..."
我绝对肯定"f_udf"在我的桌子上可以正常工作,而主要问题在于max_udf.
I'm absolutely positive "f_udf" works just fine on my table, and the main issue is with the max_udf.
无需创建额外的列或使用基本的map/reduce,是否有一种方法可以完全使用数据框和udfs完成上述操作?我应该如何修改"max_udf"?
Without creating extra columns or using basic map/reduce, is there a way to do the above entirely using dataframes and udfs? How should I modify "max_udf"?
我也尝试过:
max_udf=udf(max, IntegerType())
会产生相同的错误.
我还确认了以下工作原理:
I've also confirmed that the following works:
df2=(df.withColumn("temp1", f_udf(df.col1))
.withColumn("temp2", f_udf(df.col2))
df2=df2.withColumn("result", max_udf(df2.temp1,df2.temp2))
为什么我不能一口气做到这些?
Why is it that I can't do these in one go?
我想看到一个概括为任何功能"f_udf"和"max_udf"的答案.
推荐答案
I had a similar problem and found the solution in the answer to this stackoverflow question
To pass multiple columns or a whole row to an UDF use a struct:
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType
df = sqlContext.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
count_empty_columns = udf(lambda row: len([x for x in row if x == None]), IntegerType())
new_df = df.withColumn("null_count", count_empty_columns(struct([df[x] for x in df.columns])))
new_df.show()
返回:
+----+----+----------+
| a| b|null_count|
+----+----+----------+
|null|null| 2|
| 1|null| 1|
|null| 2| 1|
+----+----+----------+
这篇关于PySpark按行功能组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!