在PostgreSQL中根据频率获取单词在日期上的排名 [英] Get ranking of words over date based on frequency in PostgreSQL

查看:57
本文介绍了在PostgreSQL中根据频率获取单词在日期上的排名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个存储Twitter数据的数据库:

I have a database that stores twitter data:

        Create Table tweet(
            ID BIGINT UNIQUE,
            user_ID BIGINT,
            created_at TIMESTAMPTZ,
            tweet TEXT;

我正在尝试编写一个查询,该查询将遍历 tweet 中的单词,以获取所有单词的出现频率,并返回前十位最频繁出现的单词以及单词的排名每个日期.

I'm trying to write a query that goes through the words in tweet for all rows gets the frequency of each word, and returns the top ten most frequent words along with the words' ranking over each date.

示例:

("word1":[1,20,22,23,24,25,26,27,28,29,30,29,28,27,26,25,26,27,28,29,30,29,28,29,28,27,28,29,30,30,...],
'word2' [...])

我当前的查询获得了前十个单词,但是在获取每天这些单词的排名时遇到了一些麻烦.

My current query gets the top ten words, but I am having some trouble getting the rankings of those words for each day.

当前查询:

    SELECT word, count(*)
    FROM (
        SELECT regexp_split_to_table(
            regexp_replace(tweet_clean, '\y(rt|co|https|amp|f)\y', '', 'g'), '\s+')
        AS word
    FROM tweet
    ) t
    GROUP BY word
    ORDER BY count(*) DESC
    LIMIT 10;

哪个返回:

[('vaccine', 286669),
 ('covid', 213857),
 ('yum', 141345),
 ('pfizer', 39532),
 ('people', 28960),
 ('beer', 27117),
 ('say', 24569),
 ('virus', 23682),
 ('want', 21988),
 ('foo', 19823)]

推荐答案

如果您希望每天获得前10名 ,则可以执行以下操作:

If you want the top 10 per day, you can do:

select *
from (
    select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
        rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
    from tweet t
    cross join lateral regexp_split_to_table(
        regexp_replace(tweet_clean, '\y(rt|co|https|amp|f)\y', '', 'g'),
        '\s+'
    ) w(word)
    group by created_day, word
) t
where rn <= 10
order by created_day, rn desc

这篇关于在PostgreSQL中根据频率获取单词在日期上的排名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆