如何遍历pyspark中的每一行dataFrame [英] how to loop through each row of dataFrame in pyspark

查看：5604 发布时间：2018/1/27 23:12:37 python-3.x for-loop apache-spark pyspark

本文介绍了如何遍历pyspark中的每一行dataFrame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

  sqlContext = SQLContext（sc）
 
 sample = sqlContext.sql（select ）
 sample.show（）

上面的语句打印终端上的整个表，但我想用 for或while 访问该表中的每一行来执行进一步的计算。 解决方案你可以定义一个自定义函数并使用map。

$ $ $ $ c $ def defFunction（row）

return（row.name，row.age，row.city）

sample2 = sample.rdd.map（customFunction）

或者

  sample2 = sample.rdd.map（lambda x ：（x.name，x.age，x.city））

自定义函数将会是应用于数据帧的每一行。请注意，sample2将是 RDD ，而不是数据框。

如果您要执行更多操作复杂的计算。如果您只需添加派生列，您可以使用 withColumn ，并返回一个数据框。

<$ p $ sample3 = sample.withColumn（'age2'，sample.age + 2）

E.g
sqlContext = SQLContext(sc) sample=sqlContext.sql("select Name ,age ,city from user") sample.show()
The above statement print entire table on terminal but i want to access each row in that table using for or while to perform further calculations .
解决方案
You would define a custom function and use map.
def customFunction(row): return (row.name, row.age, row.city) sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.

Map is needed if you are going to perform more complex computations. If you just need to add a derived column, you can use the withColumn, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)

这篇关于如何遍历pyspark中的每一行dataFrame的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何遍历pyspark中的每一行dataFrame [英] how to loop through each row of dataFrame in pyspark

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何遍历pyspark中的每一行dataFrame [英] how to loop through each row of dataFrame in pyspark

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭