pyspark子字符串和聚合 [英] pyspark substring and aggregation

查看:196
本文介绍了pyspark子字符串和聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark的新手,我有一个包含以下数据的csv文件:

I am new to Spark and I've got a csv file with such data:

date,            accidents, injured
2015/20/03 18:00    15,          5
2015/20/03 18:30    25,          4
2015/20/03 21:10    14,          7
2015/20/02 21:00    15,          6

我想在以下特定时间汇总此数据当它发生的时候。我的想法是将日期无时间地细分为年/月/日……,因此我可以将其设置为关键。我想按每小时平均给出事故和受伤人数。也许pyspark有另一种更聪明的方法?

I would like to aggregate this data by a specific hour of when it has happened. My idea is to Substring date to 'year/month/day hh' with no minutes so I can make it a key. I wanted to give average of accidents and injured by each hour. Maybe there is a different, smarter way with pyspark?

谢谢大家!

推荐答案

嗯,我想这取决于您以后要做什么。

Well, it depends on what you're going to do afterwards, I guess.

最简单的方法是按照您的建议进行操作:对日期字符串加字符串,然后进行汇总:

The simplest way would be to do as you suggest: substring the date string and then aggregate:

data = [('2015/20/03 18:00', 15, 5), 
    ('2015/20/03 18:30', 25, 4),
    ('2015/20/03 21:10', 14, 7),
    ('2015/20/02 21:00', 15, 6)]
df = spark.createDataFrame(data, ['date', 'accidents', 'injured'])

df.withColumn('date_hr',
              df['date'].substr(1, 13)
     ).groupby('date_hr')\
      .agg({'accidents': 'avg', 'injured': 'avg'})\
      .show()

但是,如果您以后想要做更多的计算,则可以将数据解析为 TimestampType(),然后从中提取日期和小时。

If you, however, want to do some more computation later on, you can parse the data to a TimestampType() and then extract the date and hour from that.

import pyspark.sql.types as typ
from pyspark.sql.functions import col, udf
from datetime import datetime

parseString =  udf(lambda x: datetime.strptime(x, '%Y/%d/%m %H:%M'),   typ.TimestampType())
getDate =  udf(lambda x: x.date(), typ.DateType())
getHour = udf(lambda x: int(x.hour), typ.IntegerType())

df.withColumn('date_parsed', parseString(col('date'))) \
    .withColumn('date_only', getDate(col('date_parsed'))) \
    .withColumn('hour', getHour(col('date_parsed'))) \
    .groupby('date_only', 'hour') \
    .agg({'accidents': 'avg', 'injured': 'avg'})\
    .show()

这篇关于pyspark子字符串和聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆