根据日期和日期之前聚合火花数据框 [英] Aggregate a spark dataframe based on and before date

查看：20 发布时间：2021/11/14 23:28:57 scala apache-spark apache-spark-sql

本文介绍了根据日期和日期之前聚合火花数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有日期类型的 start_date 列的 DataFrame.现在我必须为 column1 中的唯一值生成度量，并且 start_date 早于或等于.以下是输入数据帧

I have a DataFrame with start_date column of date type. Now I have to generate metrics for unique values in column1 with start_date before or equal. Following is a input DataFrame

column1   column2  start_date
id1       val1     2018-03-12
id1       val2     2018-03-12
id2       val3     2018-03-12 
id3       val4     2018-03-12
id4       val5     2018-03-11
id4       val6     2018-03-11
id5       val7     2018-03-11
id5       val8     2018-03-11 
id6       val9     2018-03-10

现在我必须转换成以下，

Now I have to convert into following,

start_date     count
2018-03-12    6
2018-03-11    3
2018-03-10    1

这就是我正在做的，不是有效的方法，

This is what I am doing which is not efficient way,

找出所有不同的开始日期并将其存储为列表
遍历列表并为每个 start_date 生成输出
将所有输出合并为一个数据帧.

有没有更好的方法来做到这一点而不循环?

Is there a better way of doing it without looping ?

推荐答案

可以结合标准聚合和窗口函数，但是第二阶段不会分布式

You can combine standard aggregation with window function, but the second stage won't be distributed

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._


df
 .groupBy($"start_date")
 .agg(approx_count_distinct($"column1").alias("count"))
 .withColumn(
   "cumulative_count", sum($"count").over(Window.orderBy($"start_date")))

这篇关于根据日期和日期之前聚合火花数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据日期和日期之前聚合火花数据框 [英] Aggregate a spark dataframe based on and before date

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据日期和日期之前聚合火花数据框 [英] Aggregate a spark dataframe based on and before date

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭