如果知道每个会话的开始和结束,则计算会话数 [英] Count the number of sessions if the beginning and end of each session is known

查看:77
本文介绍了如果知道每个会话的开始和结束,则计算会话数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个hive表,其中有两列具有日期时间值:会话"的开始和结束.以下是此类表的示例:

I have a hive table with two columns with date-time values: start and finish of "session". The following is the sample of such a table:

+----------------------+----------------------+--+
| start_time           | end_time             |
+----------------------+----------------------+--+
| 2017-01-01 00:24:52  | 2017-01-01 00:25:20  |
| 2017-01-01 00:31:11  | 2017-01-01 10:31:15  |
| 2017-01-01 10:31:15  | 2017-01-01 20:40:53  |
| 2017-01-01 20:40:53  | 2017-01-01 20:40:53  |
| 2017-01-01 10:31:15  | 2017-01-01 10:31:15  |
| 2017-01-01 07:09:34  | 2017-01-01 07:29:00  |
| 2017-01-01 11:36:41  | 2017-01-01 15:32:00  |
| 2017-01-01 07:29:00  | 2017-01-01 07:34:30  |
| 2017-01-01 11:06:30  | 2017-01-01 11:36:41  |
| 2017-01-01 07:45:00  | 2017-01-01 07:50:00  |
+----------------------+----------------------+--+

有很多会议.我需要获取一个数据集,该数据集以一定的时间间隔在半小时的时间网格上显示多个会话,

There are a lot of sessions. I need to get a dataset that presents a number of sessions on half-hour time grid on some interval as following

+----------------------+--------------+--+
| time                 | sessions_num |
+----------------------+--------------+--+
| 2018-07-04 00:30:00  |          85  |
| 2018-07-04 01:00:00  |          86  |
| 2018-07-04 01:30:00  |          84  |
| 2018-07-04 02:00:00  |          85  |
| 2018-07-04 02:30:00  |          84  |
| 2018-07-04 03:00:00  |          84  |
| 2018-07-04 03:30:00  |          84  |
| 2018-07-04 04:00:00  |          84  |
| 2018-07-04 04:30:00  |          84  |
| 2018-07-04 05:00:00  |          84  |
| 2018-07-04 05:30:00  |          84  |
| 2018-07-04 06:00:00  |          84  |
| 2018-07-04 06:30:00  |          85  |
| 2018-07-04 07:00:00  |          85  |
| 2018-07-04 07:30:00  |          85  |
| 2018-07-04 08:00:00  |          85  |
| 2018-07-04 08:30:00  |          85  |
| 2018-07-04 09:00:00  |          83  |
| 2018-07-04 09:30:00  |          82  |
| 2018-07-04 10:00:00  |          82  |
| 2018-07-04 10:30:00  |          83  |
| 2018-07-04 11:00:00  |          82  |
| 2018-07-04 11:30:00  |          82  |
| 2018-07-04 12:00:00  |          83  |
+----------------------+--------------+--+

Apache HiveApache Spark是什么,或者用其他方法可以从第一个表制作最后一个表?

What is the Apache Hive or Apache Spark or maybe some other way to make last table from first one?

推荐答案

您可以使用数据框窗口函数来执行此操作,但是这需要对数据进行一些预处理. Pyspark示例:

You can do that with the dataframe window function but it will require some preprocessing of your data. Pyspark example:

#creating example dataframe
from pyspark.sql.functions import to_timestamp
l = [('2017-01-01 00:24:52','2017-01-01 00:25:20')
,('2017-01-01 00:31:11', '2017-01-01 10:31:15')
,('2017-01-01 10:31:15','2017-01-01 20:40:53')
,('2017-01-01 20:40:53','2017-01-01 20:40:53')
,('2017-01-01 10:31:15','2017-01-01 10:31:15')
,('2017-01-01 07:09:34','2017-01-01 07:29:00')
,('2017-01-01 11:36:41','2017-01-01 15:32:00')
,('2017-01-01 07:29:00','2017-01-01 07:34:30'  )
,('2017-01-01 11:06:30','2017-01-01 11:36:41'  )
,('2017-01-01 07:45:00','2017-01-01 07:50:00' )
]
df = spark.createDataFrame(l,['begin','end'])
df = df.select(to_timestamp(df.begin).alias('begin'),to_timestamp(df.end).alias('end'))

现在,我们创建一个新列,其中包含会话每30分钟的项目列表.试想一下,一个客户自会话beginnig之后每隔30分钟发出一个事件,如果最后一个事件属于另一个窗口,则另一个事件(例如,begin:2017-01-01 00:24:52 end:2017-01-01 00:25 :20导致一个事件,而开始时间:2017-01-01 07:29:00结束:2017-01-01 07:34:30引发两个事件):

Now we create a new column which contains a list of items for every 30 minutes of a session. Just imagine a client raises every 30 minutes an event since session beginnig and another one if the last event belongs to a different window (for example begin:2017-01-01 00:24:52 end:2017-01-01 00:25:20 leads to one event while begin:2017-01-01 07:29:00 end:2017-01-01 07:34:30 raises two events):

from pyspark.sql.functions import window
from pyspark.sql.types import ArrayType,TimestampType
from pyspark.sql.functions import udf, array, explode
from datetime import timedelta

def generateRows(arr):
    li = []
    li.append(arr[0])

    #range(begin,end)
    while (li[-1] + timedelta(minutes=30)) < arr[1]:
        li.append(li[-1]+ timedelta(minutes=30))

    #check if last range item and end belong to different window
    rounded = li[-1] - timedelta(minutes=li[-1].minute % 30, seconds=li[-1].second, microseconds=li[-1].microsecond)

    if (rounded + timedelta(minutes=30)) < arr[1]: 
        li.append(arr[1])

    return li

generateRows_udf = udf(lambda arr: generateRows(arr), ArrayType(TimestampType()))

dftoExplode = df.withColumn('toExplode', generateRows_udf(array(df.begin, df.end)))

现在,我们可以分解" toExplode列以为每个事件创建一行:

Now we can 'explode' the toExplode column to create one row for every event:

df_exploded = dftoExplode.withColumn('EventSessionOpen', explode('toExplode'))
df_exploded = df_exploded.drop(df_exploded.toExplode)

最后,我们可以应用数据框窗口函数来获得所需的结果:

and finally we can apply the dataframe window function to get the desired result:

result = df_exploded.groupBy(window(df_exploded.EventSessionOpen, "30 minutes")).count().orderBy("window")
result.show(truncate=False)

这篇关于如果知道每个会话的开始和结束,则计算会话数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆