如何找到最长的连续日期序列? [英] How to find longest sequence of consecutive dates?

查看:25
本文介绍了如何找到最长的连续日期序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样在时间戳中进行时间访问的数据库

I have a database with time visit in timestamp like this

ID, time
1, 1493596800
1, 1493596900
1, 1493432800
2, 1493596800
2, 1493596850
2, 1493432800

我使用 spark SQL,我需要为每个 ID 设置最长的连续日期序列,例如

I use spark SQL and I need to have the longest sequence of consecutives dates for each ID like

ID, longest_seq (days)
1, 2
2, 5
3, 1

我试图调整这个答案 使用 SQL 检测连续的日期范围我的情况,但我没有达到我的期望.

I tried to adapt this answer Detect consecutive dates ranges using SQL to my case but I didn't manage to have what I expect.

 SELECT ID, MIN (d), MAX(d)
    FROM (
      SELECT ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) AS d, 
                ROW_NUMBER() OVER(
         PARTITION BY ID ORDER BY cast(from_utc_timestamp(cast(time as timestamp), 'CEST') 
                                                           as date)) rn
      FROM purchase
      where ID is not null
      GROUP BY ID, cast(from_utc_timestamp(cast(time as timestamp), 'CEST') as date) 
    )
    GROUP BY ID, rn
    ORDER BY ID

如果有人对如何解决此请求或其中有什么问题有一些线索,我将不胜感激谢谢

If someone has some clue on how to fix this request, or what's wrong in it, I would appreciate the help Thanks

更明确的输入/输出

ID, time
1, 1
1, 2
1, 3
2, 1
2, 3
2, 4
2, 5
2, 10
2, 11
3, 1
3, 4
3, 9
3, 11

结果是:

ID, MaxSeq (in days)
1,3
2,3
3,1

所有访问都在时间戳中,但我需要连续几天,然后每天的每次访问都按天计算一次

All the visits are in timestamp, but I need consecutives days, then each visit by day is counted once by day

推荐答案

我心爱的窗口聚合函数就是这种情况!

That's the case for my beloved window aggregate functions!

我认为下面的例子可以帮助你(至少在开始时).

I think the following example could help you out (at least to get started).

以下是我使用的数据集.我将您的时间(以 longs 表示)转换为数字时间以表示一天(并避免在 Spark SQL 中混淆时间戳,这可能会使解决方案更难理解......可能).

The following is the dataset I use. I translated your time (in longs) to numeric time to denote the day (and avoid messing around with timestamps in Spark SQL which could make the solution harder to comprehend...possibly).

在下面的visit数据集中,time列代表日期之间的天数,所以1一一代表连续的天数.

In the below visit dataset, time column represents the days between dates so 1s one by one represent consecutive days.

scala> visits.show
+---+----+
| ID|time|
+---+----+
|  1|   1|
|  1|   1|
|  1|   2|
|  1|   3|
|  1|   3|
|  1|   3|
|  2|   1|
|  3|   1|
|  3|   2|
|  3|   2|
+---+----+

让我们定义窗口规范以将 id 行组合在一起.

Let's define the window specification to group id rows together.

import org.apache.spark.sql.expressions.Window
val idsSortedByTime = Window.
  partitionBy("id").
  orderBy("time")

有了它,您可以排名并计算具有相同排名的行.

With that you rank the rows and count rows with the same rank.

val answer = visits.
  select($"id", $"time", rank over idsSortedByTime as "rank").
  groupBy("id", "time", "rank").
  agg(count("*") as "count")
scala> answer.show
+---+----+----+-----+
| id|time|rank|count|
+---+----+----+-----+
|  1|   1|   1|    2|
|  1|   2|   3|    1|
|  1|   3|   4|    3|
|  3|   1|   1|    1|
|  3|   2|   2|    2|
|  2|   1|   1|    1|
+---+----+----+-----+

这似乎(非常接近?)解决方案.你看起来已经完成了!

That appears (very close?) to a solution. You seem done!

这篇关于如何找到最长的连续日期序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆