如何从Clickhouse中选择带有百分比的数据? [英] How to select data with percentages from Clickhouse?

查看:24
本文介绍了如何从Clickhouse中选择带有百分比的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出下表:

CREATE TABLE main
(
    `job_id` UUID,
    `request_time` DateTime,
    `host_id` UInt8,
    `status_code` LowCardinality(String),
)
ENGINE = MergeTree
ORDER BY request_time
SETTINGS index_granularity = 8192

我正在尝试按主机获取所有状态及其相对百分比。为此,我需要对按主机和状态分组的结果进行计数,并将每个计数的字段除以其主机字段总数。

例如,此查询将在MySQL中工作:

SELECT
    main.host_id,
    main.status_code,
    COUNT() AS status_count,
    COUNT() / sub.host_total * 100 AS percent
FROM
    main
INNER JOIN (
    SELECT host_id, COUNT() AS host_total
    FROM main
    GROUP BY host_id
) AS sub ON (sub.host_id = main.host_id)
GROUP BY
    main.host_id,
    main.status_code

但ClickHouse引发:

DB::EXCEPTION:未知标识符:HOST_TOTAL;有列:HOST_ID、STATUS_CODE、COUNT():处理HOST_ID、STATUS_CODE、COUNT()AS STATUS_COUNT、(COUNT()/HOST_TOTAL)*100作为百分比。(UNKNOWN_IDENTIER)

可能是因为correlated (dependent) subqueries are not supported

有人建议我使用CTE,所以我尝试了一下:

WITH sub AS (
    SELECT host_id, COUNT() AS host_total
    FROM main
    GROUP BY host_id
)
SELECT
    main.host_id,
    main.status_code,
    COUNT() AS status_count,
    COUNT() / (SELECT host_total FROM sub WHERE sub.host_id = main.host_id) * 100 AS percent
FROM
    main
GROUP BY
    main.host_id,
    main.status_code

但仍然没有运气:

数据库::异常:处理查询时缺少列:‘main.host_id’:‘SELECT HOST_TOTAL FROM SUB WHERE host_id=main.host_id’,必填列:‘host_Total’‘host_id’‘main.host_id’‘host_tal`’host_id‘’main.host_id‘:处理时(SELECT HOST_TOTAL FROM SUB WHERE sub.host_id=main.host_id)as_subquery20:WHILE PROCESSING COUNT()/((SELECT HOST_TOTAL FOR SUB WHERE sub.host_id=main.host_id)as_subquery20):处理时(count()/(SELECT host_talfrom fromSub,其中sub.host_id=main.host_id)as_subquery20))*100,以百分比表示。(UNKNOWN_IDENTIER)

推荐答案

CH引发错误。https://github.com/ClickHouse/ClickHouse/issues/4567

HOST_TOTAL应在GROUPBY节中或在聚合函数下

insert into main(request_time, host_id,status_code) values ( now(), 1, 200);
insert into main(request_time, host_id,status_code) values ( now(), 1, 500);
insert into main(request_time, host_id,status_code) values ( now(), 1, 200);
insert into main(request_time, host_id,status_code) values ( now(), 2, 500);
insert into main(request_time, host_id,status_code) values ( now(), 2, 200);
insert into main(request_time, host_id,status_code) values ( now(), 3, 500);

SELECT
    main.host_id,
    main.status_code,
    COUNT() AS status_count,
    round((COUNT() / any(sub.host_total)) * 100, 2) AS percent
FROM main
INNER JOIN
(
    SELECT
        host_id,
        COUNT() AS host_total
    FROM main
    GROUP BY host_id
) AS sub ON sub.host_id = main.host_id
GROUP BY
    main.host_id,
    main.status_code
ORDER BY
    main.host_id ASC,
    main.status_code ASC

┌─host_id─┬─status_code─┬─status_count─┬─percent─┐
│       1 │ 200         │            2 │   66.67 │
│       1 │ 500         │            1 │   33.33 │
│       2 │ 200         │            1 │      50 │
│       2 │ 500         │            1 │      50 │
│       3 │ 500         │            1 │     100 │
└─────────┴─────────────┴──────────────┴─────────┘

但有更好的解决方法:

窗口函数

SELECT
    host_id,
    status_code,
    status_count,
    round((status_count / host_total) * 100, 2) AS percent
FROM
(
    SELECT
        host_id,
        status_code,
        status_count,
        sum(status_count) OVER (PARTITION BY host_id) AS host_total
    FROM
    (
        SELECT
            host_id,
            status_code,
            COUNT() AS status_count
        FROM main
        GROUP BY
            host_id,
            status_code
    )
)
ORDER BY
    host_id ASC,
    status_code ASC

┌─host_id─┬─status_code─┬─status_count─┬─percent─┐
│       1 │ 200         │            2 │   66.67 │
│       1 │ 500         │            1 │   33.33 │
│       2 │ 200         │            1 │      50 │
│       2 │ 500         │            1 │      50 │
│       3 │ 500         │            1 │     100 │
└─────────┴─────────────┴──────────────┴─────────┘

数组

SELECT
    host_id,
    status_code,
    status_count,
    round((status_count / host_total) * 100, 2) AS percent
FROM
(
    SELECT
        host_id,
        sumMap([CAST(status_code, 'String')], [1]) AS ga,
        count() AS host_total
    FROM main
    GROUP BY host_id
)
ARRAY JOIN
    ga.1 AS status_code,
    ga.2 AS status_count

┌─host_id─┬─status_code─┬─status_count─┬─percent─┐
│       1 │ 200         │            2 │   66.67 │
│       1 │ 500         │            1 │   33.33 │
│       2 │ 200         │            1 │      50 │
│       2 │ 500         │            1 │      50 │
│       3 │ 500         │            1 │     100 │
└─────────┴─────────────┴──────────────┴─────────┘

这篇关于如何从Clickhouse中选择带有百分比的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆