BigQuery:如何在窗口函数上合并HLL草图? (在滚动窗口中计数不同的值) [英] BigQuery: How to merge HLL Sketches over a window function? (Count distinct values over a rolling window)
问题描述
相关表格模式示例:
+---------------------------+-------------------+
| activity_date - TIMESTAMP | user_id - STRING |
+---------------------------+-------------------+
| 2017-02-22 17:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
| 2017-02-22 04:27:08 UTC | fake_id_234885747 |
+---------------------------+-------------------+
| 2017-02-22 08:36:08 UTC | fake_id_i24385787 |
+---------------------------+-------------------+
我需要在一个滚动的时间段(90天)内对一个大型数据集进行活跃的独立用户计数,并且由于数据集的大小而遇到了问题.
I need to count active distinct users over a large data set over a rolling time period (90 days), and am running into issues due to the size of the dataset.
首先,我尝试使用窗口函数,类似于此处的答案. https://stackoverflow.com/a/27574474
At first, I attempted to use a window function, similar to the answer here. https://stackoverflow.com/a/27574474
WITH
daily AS (
SELECT
DATE(activity_date) day,
user_id
FROM
`fake-table`)
SELECT
day,
SUM(APPROX_COUNT_DISTINCT(user_id)) OVER (ORDER BY day ROWS BETWEEN 89 PRECEDING AND CURRENT ROW) ninty_day_window_apprx
FROM
daily
GROUP BY
1
ORDER BY
1 DESC
但是,这导致每天获得不同数量的用户,然后将这些数量相加-但是,如果它们出现多次,则可以在窗口中复制不同的用户.因此,这并不是对90天内不同用户的真实准确测量.
However, this resulted in getting the distinct number of users per day, then summing these up - but distincts could be duplicated within the window, if they appeared multiple times. So this is not a true accurate measure of distinct users over 90 days.
我接下来尝试的是使用以下解决方案 https://stackoverflow.com/a/47659590 -将每个窗口的所有不同的user_id连接到一个数组,然后计算其中的不同.
The next thing I tried is to use the following solution https://stackoverflow.com/a/47659590 - concatenating all the distinct user_ids for each window to an array and then counting the distincts within this.
WITH daily AS (
SELECT date(activity_date) day, STRING_AGG(DISTINCT user_id) users
FROM `fake-table`
GROUP BY day
), temp2 AS (
SELECT
day,
STRING_AGG(users) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) users
FROM daily
)
SELECT day,
(SELECT APPROX_COUNT_DISTINCT(id) FROM UNNEST(SPLIT(users)) AS id) Unique90Days
FROM temp2
order by 1 desc
但是这很快就用光了所有大容量的内存.
However this quickly ran out of memory with anything large.
下一步是使用HLL草图以较小的值表示不同的ID,因此内存将不再是问题.我以为我的问题已经解决了,但是运行以下命令时出现错误:错误仅是不支持功能MERGE_PARTIAL".我也尝试了MERGE,并遇到了相同的错误.它仅在使用窗口功能时发生.为每天的价值创建草图效果很好.
Next was to use a HLL sketch to represent the distinct IDs in a much smaller value, so memory would be less of an issue. I thought my problems were solved, but I'm getting an error when running the following: The error is simply "Function MERGE_PARTIAL is not supported." I tried with MERGE as well and got the same error. It only happens when using the window function. Creating the sketches for each day's value works fine.
我已经阅读了BigQuery Standard SQL文档,但没有看到关于带有窗口函数的HLL_COUNT.MERGE_PARTIAL和HLL_COUNT.MERGE的任何信息.大概应该采用90个草图并将它们组合成一个HLL草图,代表90个原始草图之间的不同值?
I read through the BigQuery Standard SQL documentation and don't see anything about HLL_COUNT.MERGE_PARTIAL and HLL_COUNT.MERGE with window functions. Presumably this should take the 90 sketches and combine them into one HLL sketch, representing the distinct values between the 90 original sketches?
WITH
daily AS (
SELECT
DATE(activity_date) day,
HLL_COUNT.INIT(user_id) sketch
FROM
`fake-table`
GROUP BY
1
ORDER BY
1 DESC),
rolling AS (
SELECT
day,
HLL_COUNT.MERGE_PARTIAL(sketch) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch
FROM daily)
SELECT
day,
HLL_COUNT.EXTRACT(rolling_sketch)
FROM
rolling
ORDER BY
1
任何想法为何会发生此错误或如何进行调整?
Any ideas why this error happens or how to adjust?
推荐答案
合并HLL_COUNT.INIT
和HLL_COUNT.MERGE
.此解决方案使用与GENERATE_ARRAY(1, 90)
而不是OVER
的90天交叉联接.
Combine HLL_COUNT.INIT
and HLL_COUNT.MERGE
. This solution uses a 90 days cross join with GENERATE_ARRAY(1, 90)
instead of OVER
.
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
这篇关于BigQuery:如何在窗口函数上合并HLL草图? (在滚动窗口中计数不同的值)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!