从分组查询中删除联接操作 [英] removing join operations from a grouping query

查看:84
本文介绍了从分组查询中删除联接操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的表

I have a table that looks like:

usr_id  query_ts
12345   2019/05/13 02:06
123444  2019/05/15 04:06
123444  2019/05/16 05:06
12345   2019/05/16 02:06
12345   2019/05/15 02:06

它包含用户ID和用户运行查询的时间.表中的每个条目都代表该ID在给定的时间戳记下运行1个查询.

it contains a user ID with when they ran a query. Each entry in the table represents that ID running 1 query at the given timestamp.

我正在尝试产生这个:

usr_id  day_1   day_2   …   day_30
12345   31       13           15
123444  23       41           14

我想显示每个ID在最近30天内每天运行的查询数量,如果当天没有运行查询,它将为0.

I would like to show the number of queries ran each day for the last 30 days for each ID, and if no query was run on that day it will be a 0.

这是我提出的查询的一部分,

Here is a portion of the query I came up with,

SELECT
t1.usr_id,
case when t1.count_day_1 is null then 0 else t1.count_day_1 end as day_1,
case when t2.count_day_2 is null then 0 else t2.count_day_2 end as day_2
FROM

(SELECT usr_id, DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) as day_1,
        COUNT( DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd"))) as count_day_1
        FROM db.table
        WHERE
            DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) = 1
        AND
            from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
                BETWEEN date_sub(from_unixtime(unix_timestamp()), 30)
                AND from_unixtime(unix_timestamp())
        GROUP BY usr_id, day_1) t1

LEFT JOIN
(SELECT usr_id, DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) as day_2,
        COUNT( DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd"))) as count_day_2
        FROM db.table
        WHERE
            DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) = 2
        AND
            from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
                BETWEEN date_sub(from_unixtime(unix_timestamp()), 30)
                AND from_unixtime(unix_timestamp())
        GROUP BY usr_id, day_2) t2
ON (t1.usr_id = t2.usr_id)
ORDER BY t1.usr_id;

这很好用,它显示了前两天每天运行的查询数,并将NULL替换为0.

This works great, it shows the number of queries ran each day for the first 2 days, and it replaces the NULLs with 0s.

问题是要使它在所有30天内都能正常工作,我必须使用30个LEFT JOIN,这会在群集上拉出约400GB以上的内存.

The problem is to get this working for all 30 days I have to use 30 LEFT JOINs which pulls ~400GB+ of memory on the cluster.

有更简单的方法吗?

推荐答案

尝试不加入而使用current_date或

Try to do it without join and use current_date, or current_timestamp constants, not unix_timestamp() in the WHERE, this function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant:

select usr_id,
nvl(count(case when from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "dd") = 1 then 1 end),0) as day_1,
nvl(count(case when from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "dd") = 2 then 1 end),0) as day_2
...
from db.table
        WHERE
            from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
                BETWEEN date_sub(current_date, 30) AND current_date)
group by usr_id

这篇关于从分组查询中删除联接操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆