将PySpark数据框列的聚合值存储到变量中 [英] Store aggregate value of a PySpark dataframe column into a variable

查看：172 发布时间：2020/9/4 3:58:38 apache-spark pyspark

本文介绍了将PySpark数据框列的聚合值存储到变量中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在这里使用PySpark数据框. "test1"是我的PySpark数据帧，而event_date是TimestampType.因此，当我尝试获得不同的event_date计数时，结果是一个整数变量，但是当我尝试获得同一列的最大值时，结果是一个数据帧.我想了解什么操作会导致数据帧和变量.我也想知道如何将活动日期的最大值存储为变量

I am working with PySpark dataframes here. "test1" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the result is a integer variable but when I try to get max of the same column the result is a dataframe. I would like to understand what operations result in a dataframe and variable. I would also like to know how to store the max of the event date as a variable

产生整数类型的代码:

loop_cnt=test1.select('event_date').distinct().count()
type(loop_cnt)

产生数据框类型的代码:

Code that results in dataframe type:

last_processed_dt=test1.select([max('event_date')])
type(last_processed_dt)

经过编辑以添加可复制的示例:

Edited to add a reproducible example:

schema = StructType([StructField("event_date", TimestampType(), True)])

df = sqlContext.createDataFrame([(datetime(2015, 8, 10, 2, 44, 15),),(datetime(2015, 8, 10, 3, 44, 15),)], schema)

返回数据帧的代码:

last_processed_dt=df.select([max('event_date')])
type(last_processed_dt)

返回变量的代码:

loop_cnt=df.select('event_date').distinct().count()
type(loop_cnt)

将PySpark数据框列的聚合值存储到变量中 [英] Store aggregate value of a PySpark dataframe column into a variable

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将PySpark数据框列的聚合值存储到变量中 [英] Store aggregate value of a PySpark dataframe column into a variable

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭