Spark:需要确认捕获第一个和最后一个日期的方法:在数据集上 [英] Spark : need confirmation on approach in capturing first and last date : on dataset

查看:27
本文介绍了Spark:需要确认捕获第一个和最后一个日期的方法:在数据集上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框:

A, B, C, D, 201701, 2020001
A, B, C, D, 201801, 2020002
A, B, C, D, 201901, 2020003

预期输出:

col_A, col_B, col_C ,col_D, min_week ,max_week, min_month, max_month
A,         B,     C,     D,    201701,  201901,  2020001,  2020003

我在pyspark中尝试过的事情-

What I tried in pyspark-

from pyspark.sql import Window
import pyspark.sql.functions as psf

w1 = Window.partitionBy('A','B', 'C', 'D')\
.orderBy('WEEK','MONTH')
df_new = df_source\
.withColumn("min_week", psf.first("WEEK").over(w1))\
.withColumn("max_week", psf.last("WEEK").over(w1))\
.withColumn("min_month", psf.first("MONTH").over(w1))\
.withColumn("max_month", psf.last("MONTH").over(w1))

我也尝试过的-

sql_1 = """
select A, B , C, D, first(WEEK) as min_week, 
last(WEEK) as max_week , first(MONTH) as min_month, 
last(MONTH) as max_month from df_source
group by A, B , C, D
order by A, B , C, D
"""
df_new = spark.sql(sql_1)

使用第一种和第二种方法,我得到的结果不一致.下面的方法可以解决上面遇到的问题吗?

Using the first and second approach i got non consistent results. Will the below approach work to fix the issue encountered above -

sql_1 = """
select A, B , C, D, min(WEEK) as min_week, 
max(WEEK) as max_week , min(MONTH) as min_month, 
max(MONTH) as max_month from df_source
group by A, B , C, D
order by A, B , C, D
"""
df_new = spark.sql(sql_1)

哪种方法每次都能在pyspark中完美使用?还有其他替代方法吗

Which approach works perfect in pyspark everytime? is there any alternate way

或者,第三种选择是处理此要求的最佳方法.

or, is third option the best way to handle this requirement.

任何指针都会有所帮助.

Any pointers will be helpful.

推荐答案

您提出的第三种方法每次都会起作用.您也可以这样写:

The third approach you propose will work every time. You could also write it like this:

df
    .groupBy('A', 'B', 'C', 'D')
    .agg(F.min('WEEK').alias('min_week'), F.max('WEEK').alias('max_week'),
         F.min('MONTH').alias('min_month'), F.max('MONTH').alias('max_month'))
    .show()

产生:

+---+---+---+---+--------+--------+---------+---------+
|  A|  B|  C|  D|min_week|max_week|min_month|max_month|
+---+---+---+---+--------+--------+---------+---------+
|  A|  B|  C|  D|  201701|  201901|  2020001|  2020003|
+---+---+---+---+--------+--------+---------+---------+

有趣的是,为什么前两种方法会产生不可预测的结果,而第三种方法总是可行的.

It is interesting to understand why the first two approaches produce unpredictable results while the third always works.

第二种方法是不可预测的,因为spark是一个并行计算引擎.汇总值时,首先汇总所有分区中的值,然后将结果两两汇总.但是,这些聚合的顺序不是确定的.除其他因素外,这取决于任务的完成顺序,每次尝试都会改变任务的顺序,尤其是在有大量数据的情况下.

The second approach is unpredictable because spark is a parallel computation engine. When it aggregates a value, it starts by aggregating the value in all the partitions and then the results will be aggregated two by two. Yet the order of these aggregations is not deterministic. It depends among other things on the order of completion of the tasks which can change at every attempt, in particular if there is a lot of data.

第一种方法并不完全是您要执行的操作.窗口函数不会将数据框聚合为一行.他们将计算聚合并将其添加到每一行.您还犯了几个错误.如果您订购数据框,则默认情况下,spark会考虑范围从窗口开始到当前行的窗口.因此,最大值将是该周的当前行.实际上,要计算in和max,您无需订购数据框.您可以这样做:

The first approach is not exactly what you want to do. Window functions will not aggregate the dataframe into one single row. They will compute the aggregation and add it to every row. You are also making several mistakes. If you order the dataframe, by default spark considers windows ranging from the start of the window to the current row. Therefore the maximum will be the current row for the week. In fact, to compute the in and the max, you do not need to order the dataframe. You can just do it like this:

w = Window.partitionBy('A','B', 'C', 'D')
df.select('A', 'B', 'C', 'D',
    F.min('WEEK').over(w).alias('min_week'),
    F.max('WEEK').over(w).alias('max_week'),
    F.min('MONTH').over(w).alias('min_month'),
    F.max('MONTH').over(w).alias('max_month')
).show()

产生正确的结果,但是那不是您所期望的.但是至少,您会看到窗口聚合和常规聚合之间的区别.

which yields the correct result but that was not what you were expecting. But at least, you see the difference between window aggregations and regular aggregations.

+---+---+---+---+--------+--------+---------+---------+
|  A|  B|  C|  D|min_week|max_week|min_month|max_month|
+---+---+---+---+--------+--------+---------+---------+
|  A|  B|  C|  D|  201701|  201901|  2020001|  2020003|
|  A|  B|  C|  D|  201701|  201901|  2020001|  2020003|
|  A|  B|  C|  D|  201701|  201901|  2020001|  2020003|
+---+---+---+---+--------+--------+---------+---------+

这篇关于Spark:需要确认捕获第一个和最后一个日期的方法:在数据集上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆