Pyspark-从DataFrame列的操作创建新列会给出错误“列不可迭代". [英] Pyspark - create new column from operations of DataFrame columns gives error "Column is not iterable"
问题描述
我有一个PySpark DataFrame,我尝试了许多示例,这些示例展示了如何基于对现有列的操作来创建新列,但是似乎都没有用.
I have a PySpark DataFrame and I have tried many examples showing how to create a new column based on operations with existing columns, but none of them seem to work.
所以我有一个问题:
1-为什么此代码不起作用?
1- Why doesn't this code work?
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
sc = SparkContext()
sqlContext = SQLContext(sc)
a = sqlContext.createDataFrame([(5, 5, 3)], ['A', 'B', 'C'])
a.withColumn('my_sum', F.sum(a[col] for col in a.columns)).show()
我得到了错误:
TypeError: Column is not iterable
答案1
我了解了如何进行这项工作.我必须使用本机Python sum
函数. a.withColumn('my_sum', F.sum(a[col] for col in a.columns)).show()
.它有效,但我不知道为什么.
I found out how to make this work. I have to use the native Python sum
function. a.withColumn('my_sum', F.sum(a[col] for col in a.columns)).show()
. It works, but I have no idea why.
2-如果有一种方法可以使这个总和起作用,我如何编写一个udf
函数来做到这一点(并将结果添加到DataFrame的新列中)?
2- If there is a way to make this sum work, how can I write a udf
function to do this (and add the result to a new column of a DataFrame)?
import numpy as np
def my_dif(row):
d = np.diff(row) # creates an array of differences element by element
return d.mean() # returns the mean of the array
我正在使用Python 3.6.1和Spark 2.1.1.
I am using Python 3.6.1 and Spark 2.1.1.
谢谢!
推荐答案
a = sqlContext.createDataFrame([(5, 5, 3)], ['A', 'B', 'C'])
a = a.withColumn('my_sum', F.UserDefinedFunction(lambda *args: sum(args), IntegerType())(*a.columns))
a.show()
+---+---+---+------+
| A| B| C|my_sum|
+---+---+---+------+
| 5| 5| 3| 13|
+---+---+---+------+
这篇关于Pyspark-从DataFrame列的操作创建新列会给出错误“列不可迭代".的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!