如何使用pyspark从某些字段分组的给定数据集中获取max(date)? [英] how to get max(date) from given set of data grouped by some fields using pyspark?

查看:445
本文介绍了如何使用pyspark从某些字段分组的给定数据集中获取max(date)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在数据框中有如下数据:

I have the data in the dataframe as below:

  datetime             | userId | memberId | value |    
2016-04-06 16:36:...   | 1234   | 111      | 1
2016-04-06 17:35:...   | 1234   | 222      | 5
2016-04-06 17:50:...   | 1234   | 111      | 8
2016-04-06 18:36:...   | 1234   | 222      | 9
2016-04-05 16:36:...   | 4567   | 111      | 1
2016-04-06 17:35:...   | 4567   | 222      | 5
2016-04-06 18:50:...   | 4567   | 111      | 8
2016-04-06 19:36:...   | 4567   | 222      | 9

我需要通过用户ID,成员ID查找最大(日期时间)分组.当我尝试如下操作时:

I need to find the max(datetime) groupby userid,memberid. When I tried as below:

df2 = df.groupBy('userId','memberId').max('datetime')

我收到以下错误消息:

org.apache.spark.sql.AnalysisException: "datetime" is not a numeric
column. Aggregation function can only be applied on a numeric column.;

我想要的输出如下:

userId | memberId | datetime
1234   |  111     | 2016-04-06 17:50:...
1234   |  222     | 2016-04-06 18:36:...
4567   |  111     | 2016-04-06 18:50:...
4567   |  222     | 2016-04-06 19:36:...

有人可以帮助我如何使用PySpark数据框获取给定数据中的最大日期吗?

Can someone please help me how I get the max date among the given data using PySpark dataframes?

推荐答案

对于非数字类型的Orderable类型,可以将aggmax直接结合使用:

For non-numeric but Orderable types you can use agg with max directly:

from pyspark.sql.functions import col, max as max_

df = sc.parallelize([
    ("2016-04-06 16:36", 1234, 111, 1),
    ("2016-04-06 17:35", 1234, 111, 5),
]).toDF(["datetime", "userId", "memberId", "value"])

(df.withColumn("datetime", col("datetime").cast("timestamp"))
    .groupBy("userId", "memberId")
    .agg(max_("datetime")))

## +------+--------+--------------------+
## |userId|memberId|       max(datetime)|
## +------+--------+--------------------+
## |  1234|     111|2016-04-06 17:35:...|
## +------+--------+--------------------+

这篇关于如何使用pyspark从某些字段分组的给定数据集中获取max(date)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆