每组第一行 [英] First row for each group

查看:90
本文介绍了每组第一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表,其中包含每次访问端点的行.表格看起来像这样:

I have table containing row for each visit to an endpoint. Table looks like this:

user_id STRING
endpoint_id STRING
created_at TIMESTAMP

示例数据:

user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-02 01:01:01 UTC
user-1, endpoint-1, 2016-01-03 01:01:01 UTC

如何获取每个用户和资源的首次访问行.

构造这种查询的最佳方法是什么?

What is the best way to construct such query?

预期结果:

user-1, endpoint-1, 2016-01-01 01:01:01 UTC
user-2, endpoint-1, 2016-01-01 01:01:01 UTC
user-1, endpoint-2, 2016-01-02 01:01:01 UTC


这是我想出的,但是此查询不适用于大量数据.我使用窗口功能将废除用户/资源行分组在一起:


Here's what I came up with, but this query will not work for big amount of data. I use window function to group together repeaing user/resource rows:

SELECT
    user_id,
    endpoint_id,
    created_at
FROM (
    SELECT 
        poll_id, 
        endpoint_id, 
        created_at,
        FIRST_VALUE(created_at) OVER (PARTITION BY user_id, endpoint_id ORDER BY created_at DESC) AS first_created_at
    FROM 
        [visits]
    )
WHERE
    created_at = first_created_at

推荐答案

如何获取每个用户和资源的首次访问行?

How to get first visit row for each user and resource?

在您提出问题的查询中-应该删除ORDER BY created_at DESC中的DESC,否则返回上次访问-而不是第一个

In query you presented in question - should remove DESC in ORDER BY created_at DESC otherwise it returns last visit - not first

构造这种查询的最佳方法是什么?

What is the best way to construct such query?

另一种选择是使用ROW_NUMBER()如下

Another option would be to use ROW_NUMBER() as below

 SELECT
  user_id,
  endpoint_id,
  created_at
FROM (
  SELECT 
      user_id, 
      endpoint_id, 
      created_at,
      ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) AS first_created
  FROM [visits]
)
WHERE first_created = 1

...但是此查询不适用于大量数据

... but this query will not work for big amount of data

这真的取决于.如果user_id, endpoint_id分区的大小足够大(因为ORDER BY要求分区的所有行都在同一节点上),则可能会发生Resources Exceeded.

This really depends. Resources Exceeded can happen If size of your user_id, endpoint_id partition is BIG enough (as ORDER BY requires all rows of partition to be on the same node).

如果这是您的情况-您可以在trick

第1步-使用JOIN

SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at 
FROM [visits] AS tab1
INNER JOIN (
  SELECT user_id, endpoint_id, MIN(created_at) AS min_time 
  FROM [visits] 
  GROUP BY user_id, endpoint_id
) AS tab2
ON  tab1.user_id = tab2.user_id 
AND tab1.endpoint_id = tab2.endpoint_id 
AND tab1.created_at = tab2.min_time  

第2步-这里还有其他需要注意的地方-以防万一您有相同用户/资源的重复条目.在这种情况下,您仍然只需要为每个分区提取一行.参见下面的最终查询

Step 2 - There is still something else to take care here - in case if you have duplicate entries for same user / resource. In this case you still need to extract only one row for each partition. See below final query

 SELECT user_id, endpoint_id, created_at
FROM (
  SELECT user_id, endpoint_id, created_at, 
    ROW_NUMBER() OVER (PARTITION BY user_id, endpoint_id) AS rn 
  FROM (
    SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at 
    FROM [visits]  AS tab1
    INNER JOIN (
      SELECT user_id, endpoint_id, MIN(created_at) AS min_time 
      FROM [visits]  
      GROUP BY user_id, endpoint_id
    ) AS tab2
    ON  tab1.user_id = tab2.user_id 
    AND tab1.endpoint_id = tab2.endpoint_id 
    AND tab1.created_at = tab2.min_time
  )
)
WHERE rn = 1  

当然是最简单的情况-如果这三个字段是 [访问]表中的唯一"字段

and of course obvious and simplest Case - if those three fields are the ONLY fields in [visits] table

SELECT user_id, endpoint_id, MIN(created_at) AS created_at 
FROM [visits]
GROUP BY user_id, endpoint_id

这篇关于每组第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆