如何遍历pyspark中的每一行dataFrame [英] how to loop through each row of dataFrame in pyspark
问题描述
sqlContext = SQLContext(sc)
sample = sqlContext.sql(select )
sample.show()
上面的语句打印终端上的整个表,但我想用 for或while 访问该表中的每一行来执行进一步的计算。 解决方案你可以定义一个自定义函数并使用map。
$ $ $ $ c $ def defFunction(row)
return(row.name,row.age,row.city)
sample2 = sample.rdd.map(customFunction)
或者
sample2 = sample.rdd.map(lambda x :(x.name,x.age,x.city))
自定义函数将会是应用于数据帧的每一行。请注意,sample2将是 RDD
,而不是数据框。
如果您要执行更多操作复杂的计算。如果您只需添加派生列,您可以使用 withColumn
,并返回一个数据框。
<$ p $ sample3 = sample.withColumn('age2',sample.age + 2)
E.g
sqlContext = SQLContext(sc)
sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement print entire table on terminal but i want to access each row in that table using for or while to perform further calculations .
You would define a custom function and use map.
def customFunction(row):
return (row.name, row.age, row.city)
sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD
, not a dataframe.
Map is needed if you are going to perform more complex computations. If you just need to add a derived column, you can use the withColumn
, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)
这篇关于如何遍历pyspark中的每一行dataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!