将PySpark数据框列的聚合值存储到变量中 [英] Store aggregate value of a PySpark dataframe column into a variable

查看:172
本文介绍了将PySpark数据框列的聚合值存储到变量中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里使用PySpark数据框. "test1"是我的PySpark数据帧,而event_date是TimestampType.因此,当我尝试获得不同的event_date计数时,结果是一个整数变量,但是当我尝试获得同一列的最大值时,结果是一个数据帧.我想了解什么操作会导致数据帧和变量.我也想知道如何将活动日期的最大值存储为变量

I am working with PySpark dataframes here. "test1" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the result is a integer variable but when I try to get max of the same column the result is a dataframe. I would like to understand what operations result in a dataframe and variable. I would also like to know how to store the max of the event date as a variable

产生整数类型的代码:

loop_cnt=test1.select('event_date').distinct().count()
type(loop_cnt)

产生数据框类型的代码:

Code that results in dataframe type:

last_processed_dt=test1.select([max('event_date')])
type(last_processed_dt)

经过编辑以添加可复制的示例:

Edited to add a reproducible example:

schema = StructType([StructField("event_date", TimestampType(), True)])

df = sqlContext.createDataFrame([(datetime(2015, 8, 10, 2, 44, 15),),(datetime(2015, 8, 10, 3, 44, 15),)], schema)

返回数据帧的代码:

last_processed_dt=df.select([max('event_date')])
type(last_processed_dt)

返回变量的代码:

loop_cnt=df.select('event_date').distinct().count()
type(loop_cnt) 

推荐答案

我很确定df.select([max('event_date')])返回一个DataFrame,因为在该列中可能存在不止一个具有最大值的行.在您的特定用例中,该列中没有两行可能具有相同的值,但是很容易想象一种情况,即多于一行可以具有相同的最大event_date.

I'm pretty sure df.select([max('event_date')]) returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max event_date.

df.select('event_date').distinct().count()返回一个整数,因为它告诉您该特定列中有多少个不同的值.它不会告诉您哪个值最大.

df.select('event_date').distinct().count() returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest.

如果您想让代码获取最大值event_date并将其存储为变量,请尝试以下max_date = df.select([max('event_date')]).distinct().collect()

If you want code to get the max event_date and store it as a variable, try the following max_date = df.select([max('event_date')]).distinct().collect()

这篇关于将PySpark数据框列的聚合值存储到变量中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆