提取BigQuery中按日期分组的主题标签时间轴的最有效查询 [英] Most efficient query to extract hashtags timeline grouped by date in BigQuery
问题描述
我想从注释数据列中提取按日期分组的标签时间轴.数据是JSON列.我们必须计算可以在't'或'd'属性中的主题标签
I want to extract the Hashtag Timeline grouped by date from the notes data column. Data is a JSON column. We have to count hashtags that can be in 't' OR 'd' property
Table: notes
----------------------------------------------------------------------
| id | data | created_at |
----------------------------------------------------------------------
| 1 | {"t":"#hash1 title","d":"#hash1 desc"} | 2018-01-01 10:00:00 |
| 2 | {"t":"#hash1 title","d":"#hash1 desc"} | 2018-01-01 11:00:00 |
| 3 | {"t":"title","d":"#hash1 #hash2 desc"} | 2018-01-03 10:00:00 |
如下所述,所需的输出需要具有格式为DATE:COUNT|DATE:COUNT|DATE:COUNT
As described below the required output needs to have each hashtag with it's corresponding timeline in format: DATE:COUNT|DATE:COUNT|DATE:COUNT
Required Output
----------------------------------------------------------
| hashtag | timeline |
----------------------------------------------------------
| #hash1 | 2018-01-01:4|2018-01-03:1 |
| #hash2 | 2018-01-03:1 |
具有所有这些功能的最有效的单个查询是什么?
What is the most efficient single query that has all these features:
- 从数据的"t"和"d"属性中提取标签.
- 计算按日期分组的#标签.
- 以所需的格式连接各个主题标签时间轴.
更新1: 以下是我的查询.这是低效率的,因为我必须进行2次UNNEST.我不知道如何提高效率.
UPDATE 1: Below is my query. It's inefficient because I have to UNNEST 2 times. I am not able to figure out how to make it efficient.
WITH
r0 AS (
SELECT JSON_EXTRACT_SCALAR(data, '$[d]') as data, created_at
FROM `notes`
UNION ALL
SELECT JSON_EXTRACT_SCALAR(data, '$[t]') as data, created_at
from `notes`
),
r1 AS (
SELECT created_at, REGEXP_EXTRACT_ALL(data, r"#(\w*[0-9a-zA-Z]+\w*[0-9a-zA-Z])") AS hashtags
FROM r0
),
r2 AS (
SELECT ARRAY_AGG(DATE(created_at)) as created_at_dates, hashtag
FROM r1, UNNEST(hashtags) hashtag
GROUP BY hashtag
),
r3 AS (
SELECT created_at_date, hashtag
FROM r2, UNNEST(created_at_dates) created_at_date
),
r4 AS (
SELECT hashtag, created_at_date, count(created_at_date) as day_val
FROM r3
GROUP BY hashtag, created_at_date
ORDER BY created_at_date
)
SELECT hashtag, STRING_AGG(CONCAT(CAST(created_at_date as STRING),':',CAST(day_val as STRING)), '|') as timeline
FROM r4
GROUP BY hashtag
推荐答案
以下是BigQuery标准SQL
Below is for BigQuery Standard SQL
#standardSQL
SELECT hashtag,
STRING_AGG(CONCAT(day, ':', cnt), '|' ORDER BY day) AS timeline
FROM (
SELECT hashtag,
CAST(DATE(created_at) AS STRING) day,
CAST(COUNT(1) AS STRING) cnt
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(data, r'"(?:t|d)":(".*?")')) val,
UNNEST(REGEXP_EXTRACT_ALL(val, r'(#.*?)\s')) hashtag
GROUP BY hashtag, day
)
GROUP BY hashtag
如果您需要提取的不仅仅是t
和d
属性-您只需将它们添加到(?:t|d)
列表中,而不是使用多个UNION ALL
If you need to extract for more than just t
and d
properties - you just add them into (?:t|d)
list as opposed to using multiple UNION ALL
's
如果要对问题中的示例数据执行以上操作-结果为
if to execute above on sample data in your question - result is
Row hashtag timeline
1 #hash1 2018-01-01:4|2018-01-03:1
2 #hash2 2018-01-03:1
更新为解决@ user2576951注释中提到的深度结构"
Update to address "deep structure" mentioned in @user2576951 comment
请参见下面的更新以及用于测试的伪数据
See below update along with dummy data to test with
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, '{"x":"title","t":"#hash1 title","d":"help #hash1 desc"}' data, TIMESTAMP '2018-01-01 10:00:00' created_at UNION ALL
SELECT 2, '{"t":"#hash1 title","y":"title","d":"#hash1 desc"}', '2018-01-01 11:00:00' UNION ALL
SELECT 3, '{"t":"title","d":"#hash1 #hash2 desc","z":"title"}', '2018-01-03 10:00:00' UNION ALL
SELECT 4, '{"t":"title","d":"description","snippet":{"t":"#hash1","st":"#hash1", "ssd":"#hash3"}}', '2018-02-03 10:00:00'
)
SELECT hashtag,
STRING_AGG(CONCAT(day, ':', cnt), '|' ORDER BY day) AS timeline
FROM (
SELECT
hashtag,
CAST(DATE(created_at) AS STRING) day,
CAST(COUNT(1) AS STRING) cnt
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(data, r'"(?:t|d|st|sd)":"(.*?)"')) val,
UNNEST(REGEXP_EXTRACT_ALL(val, r'(#.*?)(?:$|\s)')) hashtag
GROUP BY hashtag, day
)
GROUP BY hashtag
-- ORDER BY hashtag
有输出
Row hashtag timeline
1 #hash1 2018-01-01:4|2018-01-03:1|2018-02-03:2
2 #hash2 2018-01-03:1
正如您在此处看到的那样,主题标签是从嵌套元素中收集的,即使sd
是其中的一部分,也不会匹配"ssd"
as you can see here hashtags are collected from nested elements and "ssd" was not matched even though sd
is part of it
我认为以上解决了您的两个评论/关注点
I think above addresses your both comments / concerns
这篇关于提取BigQuery中按日期分组的主题标签时间轴的最有效查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!