使用PostgreSQL找到流行的字符串 [英] Locate popular strings with PostgreSQL

查看:119
本文介绍了使用PostgreSQL找到流行的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在PostgreSQL表中有一堆文本行,并且我正在尝试查找常见的字符串。

I have a bunch of text rows in a PostgreSQL table and I am trying to find common strings.

例如,假设我有一个基本表,例如:

For example, let's say I have a basic table like:

CREATE TABLE a (id serial, value text);
INSERT INTO a (value) VALUES
    ('I go to the movie theater'), 
    ('New movie theater releases'), 
    ('Coming out this week at your local movie theater'),
    ('New exposition about learning disabilities at the children museum'),
    ('The genius found in learning disabilities')
;

我正在尝试查找电影院学习障碍在所有行中(目标是显示趋势字符串之类的列表,例如Twitter趋势之王)

I am trying to locate popular strings like movie theater and learning disabilities across all the rows (the goal is to show a list of "trending" strings king of like Twitter "Trends")

我使用全文搜索,并且尝试将 ts_stat ts_headline 结合使用,但是结果令人失望。

I use full text search and I have tried to use ts_stat combined with ts_headline but the results are quite disappointing.

有什么想法吗?谢谢!

推荐答案

没有现成的Posgres文本搜索功能可以找到最受欢迎的短语。对于两个单词的短语,您可以使用 ts_stat()查找最受欢迎的单词,消除质点,介词等,然后交叉连接这些单词以找到最受欢迎的单词对。

There is no ready-to-use Posgres text search feature to find most popular phrases. For two-words phrases you can use ts_stat() to find most popular words, eliminate particles, prepositions etc, and cross join these words to find most popular pairs.

对于实际数据,您需要更改标记为->的值。参数。在较大的数据集上,查询可能会非常昂贵。

For an actual data you would want to change values marked as --> parameter. The query may be quite expensive on a larger dataset.

with popular_words as (
    select word
    from ts_stat('select value::tsvector from a')
    where nentry > 1                                --> parameter
    and not word in ('to', 'the', 'at', 'in', 'a')  --> parameter
)
select concat_ws(' ', a1.word, a2.word) phrase, count(*) 
from popular_words as a1
cross join popular_words as a2
cross join a
where value ilike format('%%%s %s%%', a1.word, a2.word)
group by 1
having count(*) > 1                                 --> parameter
order by 2 desc;


        phrase         | count 
-----------------------+-------
 movie theater         |     3
 learning disabilities |     2
(2 rows)

这篇关于使用PostgreSQL找到流行的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆