将udf应用于多个列并使用numpy操作 [英] apply udf to multiple columns and use numpy operations
问题描述
我在pyspark中有一个名为result的数据框,我想应用udf来创建一个新列,如下所示:
I have a dataframe named result in pyspark and I want to apply a udf to create a new column as below:
result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)]).withColumnRenamed("_1","count").withColumnRenamed("_2","df").withColumnRenamed("_3","docs")
@udf("float")
def newFunction(arr):
return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
result=result.withColumn("new_function_result",newFunction_udf(array("count","df","docs")))
列数,df,docs均为整数列.但这将返回
the column count,df,docs all are integer columns.but this returns
Py4JError:调用时发生错误 z:org.apache.spark.sql.functions.col.跟踪:py4j.Py4JException: 方法col([class java.util.ArrayList])在以下位置不存在 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) 在 py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) 在py4j.Gateway.invoke(Gateway.java:274)处 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在py4j.commands.CallCommand.execute(CallCommand.java:79)处 py4j.GatewayConnection.run(GatewayConnection.java:214)位于 java.lang.Thread.run(Thread.java:748)
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace: py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748)
当我尝试通过一列并获得其平方时,它工作正常.
When I try passing one column and getting squares of those it works fine.
感谢您的帮助.
推荐答案
该错误消息具有误导性,但试图告诉您您的函数未返回浮点数.您的函数返回类型为numpy.float64
的值,您可以使用VectorUDT类型(在以下示例中为函数:newFunctionVector
)来获取该值.使用numpy的另一种方法是将numpy类型numpy.float64
强制转换为python类型float(在以下示例中为Function:newFunctionWithArray
).
The error message is misleading, but is trying to tell you that your function doesn't return a float. Your function returns value of type numpy.float64
which you can fetch with the VectorUDT type (Function: newFunctionVector
in the example below). Another way to make use of numpy is by casting the numpy type numpy.float64
to the python type float (Function: newFunctionWithArray
in the example below).
Last but not least, it is not necessary to call array as udfs can use more than one parameter (Function: newFunction
in the example below).
import numpy as np
from pyspark.sql.functions import udf, array
from pyspark.sql.types import FloatType
from pyspark.mllib.linalg import Vectors, VectorUDT
result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)], ["count","df","docs"])
def newFunctionVector(arr):
return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
@udf("float")
def newFunctionWithArray(arr):
returnValue = (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
return returnValue.item()
@udf("float")
def newFunction(count, df, docs):
returnValue = (1 + np.log(count)) * np.log(docs/df)
return returnValue.item()
vector_udf = udf(newFunctionVector, VectorUDT())
result=result.withColumn("new_function_result", newFunction("count","df","docs"))
result=result.withColumn("new_function_result_WithArray", newFunctionWithArray(array("count","df","docs")))
result=result.withColumn("new_function_result_Vector", newFunctionWithArray(array("count","df","docs")))
result.printSchema()
result.show()
输出:
root
|-- count: long (nullable = true)
|-- df: long (nullable = true)
|-- docs: long (nullable = true)
|-- new_function_result: float (nullable = true)
|-- new_function_result_WithArray: float (nullable = true)
|-- new_function_result_Vector: float (nullable = true)
+-----+---+----+-------------------+-----------------------------+--------------------------+
|count| df|docs|new_function_result|new_function_result_WithArray|new_function_result_Vector|
+-----+---+----+-------------------+-----------------------------+--------------------------+
| 138| 5| 10| 4.108459| 4.108459| 4.108459|
| 128| 4| 10| 5.362161| 5.362161| 5.362161|
| 112| 3| 10| 6.8849173| 6.8849173| 6.8849173|
| 120| 3| 10| 6.967983| 6.967983| 6.967983|
| 189| 1| 10| 14.372153| 14.372153| 14.372153|
+-----+---+----+-------------------+-----------------------------+--------------------------+
这篇关于将udf应用于多个列并使用numpy操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!