如何遍历pyspark中的每一行dataFrame [英] how to loop through each row of dataFrame in pyspark

查看：430 发布时间：2021/11/14 21:25:53 apache-spark dataframe for-loop pyspark apache-spark-sql

本文介绍了如何遍历pyspark中的每一行dataFrame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

例如

sqlContext = SQLContext(sc)

sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()

上述语句在终端上打印整个表，但我想使用 for 或 while 访问该表中的每一行以执行进一步的计算.

The above statement print entire table on terminal but i want to access each row in that table using for or while to perform further calculations .

推荐答案

要循环"并利用 Spark 的并行计算框架，您可以定义自定义函数并使用 map.

To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.

def customFunction(row):

   return (row.name, row.age, row.city)

sample2 = sample.rdd.map(customFunction)

或

sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))

然后自定义函数将应用于数据框的每一行.请注意，sample2 将是 RDD，而不是数据帧.

The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.

如果您要执行更复杂的计算，则可能需要 Map.如果您只需要添加一个简单的派生列，您可以使用 withColumn，并返回一个数据框.

Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.

sample3 = sample.withColumn('age2', sample.age + 2)

这篇关于如何遍历pyspark中的每一行dataFrame的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何遍历pyspark中的每一行dataFrame [英] how to loop through each row of dataFrame in pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何遍历pyspark中的每一行dataFrame [英] how to loop through each row of dataFrame in pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭