如何遍历Spark数据框 [英] How can I loop through a Spark data frame

查看：79 发布时间：2020/9/4 21:56:00 scala apache-spark apache-spark-sql

本文介绍了如何遍历Spark数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何遍历Spark数据帧? 我有一个包含以下内容的数据框:

How can I loop through a Spark data frame? I have a data frame that consists of:

time, id, direction
10, 4, True  //here 4 enters --> (4,)
20, 5, True  //here 5 enters --> (4,5)
34, 5, False //here 5 leaves --> (4,)
67, 6, True  //here 6 enters --> (4,6)
78, 6, False //here 6 leaves --> (4,)
99, 4, False //here 4 leaves --> ()

它按时间排序，现在我想逐步了解并累积有效ID. ID在direction == True上输入，并在direction == False上退出

it is sorted by time and now I would like to step through and accumulate the valid ids. The ids enter on direction==True and exit on direction==False

所以生成的RDD应该看起来像这样

so the resulting RDD should look like this

time, valid_ids
(10, (4,))
(20, (4,5))
(34, (4,))
(67, (4,6))
(78, (4,)
(99, ())

我知道这不会并行化，但是df并没有那么大.那么如何在Spark/Scala中完成呢?

I know that this will not parallelize, but the df is not that big. So how could this be done in Spark/Scala?

推荐答案

如果数据很小("但df没那么大")，我将使用Scala集合进行收集和处理.如果类型如下所示:

If data is small ("but the df is not that big") I'd just collect and process using Scala collections. If types are as shown below:

df.printSchema
root
 |-- time: integer (nullable = false)
 |-- id: integer (nullable = false)
 |-- direction: boolean (nullable = false)

您可以收集:

val data = df.as[(Int, Int, Boolean)].collect.toSeq

和scanLeft:

val result = data.scanLeft((-1, Set[Int]())){ 
  case ((_, acc), (time, value, true)) => (time, acc + value)
  case ((_, acc), (time, value, false))  => (time, acc - value)
}.tail

这篇关于如何遍历Spark数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何遍历Spark数据框 [英] How can I loop through a Spark data frame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何遍历Spark数据框 [英] How can I loop through a Spark data frame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭