Spark:需要确认捕获第一个和最后一个日期的方法:在数据集上 [英] Spark : need confirmation on approach in capturing first and last date : on dataset

查看:21
本文介绍了Spark:需要确认捕获第一个和最后一个日期的方法:在数据集上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框:

A, B, C, D, 201701, 2020001
A, B, C, D, 201801, 2020002
A, B, C, D, 201901, 2020003

预期输出:

col_A, col_B, col_C ,col_D, min_week ,max_week, min_month, max_month
A,         B,     C,     D,    201701,  201901,  2020001,  2020003

我在 pyspark 中的尝试-

What I tried in pyspark-

from pyspark.sql import Window
import pyspark.sql.functions as psf

w1 = Window.partitionBy('A','B', 'C', 'D')\
.orderBy('WEEK','MONTH')
df_new = df_source\
.withColumn("min_week", psf.first("WEEK").over(w1))\
.withColumn("max_week", psf.last("WEEK").over(w1))\
.withColumn("min_month", psf.first("MONTH").over(w1))\
.withColumn("max_month", psf.last("MONTH").over(w1))

我也尝试过 -

sql_1 = """
select A, B , C, D, first(WEEK) as min_week, 
last(WEEK) as max_week , first(MONTH) as min_month, 
last(MONTH) as max_month from df_source
group by A, B , C, D
order by A, B , C, D
"""
df_new = spark.sql(sql_1)

使用第一种和第二种方法我得到了不一致的结果.下面的方法是否可以解决上面遇到的问题 -

Using the first and second approach i got non consistent results. Will the below approach work to fix the issue encountered above -

sql_1 = """
select A, B , C, D, min(WEEK) as min_week, 
max(WEEK) as max_week , min(MONTH) as min_month, 
max(MONTH) as max_month from df_source
group by A, B , C, D
order by A, B , C, D
"""
df_new = spark.sql(sql_1)

哪种方法每次都能在 pyspark 中完美运行?有没有其他办法

Which approach works perfect in pyspark everytime? is there any alternate way

或者,第三种选择是处理此要求的最佳方式.

or, is third option the best way to handle this requirement.

任何指示都会有所帮助.

Any pointers will be helpful.

推荐答案

您提出的第三种方法每次都有效.你也可以这样写:

The third approach you propose will work every time. You could also write it like this:

df
    .groupBy('A', 'B', 'C', 'D')
    .agg(F.min('WEEK').alias('min_week'), F.max('WEEK').alias('max_week'),
         F.min('MONTH').alias('min_month'), F.max('MONTH').alias('max_month'))
    .show()

产生:

+---+---+---+---+--------+--------+---------+---------+
|  A|  B|  C|  D|min_week|max_week|min_month|max_month|
+---+---+---+---+--------+--------+---------+---------+
|  A|  B|  C|  D|  201701|  201901|  2020001|  2020003|
+---+---+---+---+--------+--------+---------+---------+

理解为什么前两种方法会产生不可预测的结果而第三种方法总是有效的原因很有趣.

It is interesting to understand why the first two approaches produce unpredictable results while the third always works.

第二种方法是不可预测的,因为 spark 是一个并行计算引擎.当它聚合一个值时,它首先聚合所有分区中的值,然后将结果两两聚合.然而,这些聚合的顺序不是确定性的.这取决于任务的完成顺序,每次尝试都会发生变化,尤其是在有大量数据的情况下.

The second approach is unpredictable because spark is a parallel computation engine. When it aggregates a value, it starts by aggregating the value in all the partitions and then the results will be aggregated two by two. Yet the order of these aggregations is not deterministic. It depends among other things on the order of completion of the tasks which can change at every attempt, in particular if there is a lot of data.

第一种方法并不完全是您想要做的.窗口函数不会将数据帧聚合为一行.他们将计算聚合并将其添加到每一行.你也犯了几个错误.如果您对数据框进行排序,默认情况下,spark 会考虑从窗口开头到当前行的窗口.因此,最大值将是本周的当前行.事实上,要计算 in 和 max,您不需要对数据帧进行排序.你可以这样做:

The first approach is not exactly what you want to do. Window functions will not aggregate the dataframe into one single row. They will compute the aggregation and add it to every row. You are also making several mistakes. If you order the dataframe, by default spark considers windows ranging from the start of the window to the current row. Therefore the maximum will be the current row for the week. In fact, to compute the in and the max, you do not need to order the dataframe. You can just do it like this:

w = Window.partitionBy('A','B', 'C', 'D')
df.select('A', 'B', 'C', 'D',
    F.min('WEEK').over(w).alias('min_week'),
    F.max('WEEK').over(w).alias('max_week'),
    F.min('MONTH').over(w).alias('min_month'),
    F.max('MONTH').over(w).alias('max_month')
).show()

产生正确的结果,但这不是您所期望的.但至少,您会看到窗口聚合和常规聚合之间的区别.

which yields the correct result but that was not what you were expecting. But at least, you see the difference between window aggregations and regular aggregations.

+---+---+---+---+--------+--------+---------+---------+
|  A|  B|  C|  D|min_week|max_week|min_month|max_month|
+---+---+---+---+--------+--------+---------+---------+
|  A|  B|  C|  D|  201701|  201901|  2020001|  2020003|
|  A|  B|  C|  D|  201701|  201901|  2020001|  2020003|
|  A|  B|  C|  D|  201701|  201901|  2020001|  2020003|
+---+---+---+---+--------+--------+---------+---------+

这篇关于Spark:需要确认捕获第一个和最后一个日期的方法:在数据集上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆