如何从Postgres中的文本中提取n-gram单词序列 [英] How to extract n-gram word sequences from text in Postgres

查看:132
本文介绍了如何从Postgres中的文本中提取n-gram单词序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望使用Postgres从Text中提取单词序列. 例如以下句子的整个单词三字母组

I am hoping to use Postgres to extract sequences of words from Text. For example the whole word trigrams for the following sentence

"ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium"

将会

  • "ed ut perspiciatis"
  • "ut perspiciatis unde"
  • "perspiciatis unde omnis" ...
  • "ed ut perspiciatis"
  • "ut perspiciatis unde"
  • "perspiciatis unde omnis" ...

我一直在使用R进行此操作,但我希望Postgres能够更有效地处理它.

I have been doing this with R but I am hoping Postgres would be able to handle it more efficiently.

我在这里看到了类似的问题, n-grams in PostgreSQL中的文本 但我不明白如何使用pg_trgm提取单词序列

I have seen a similar question asked here n-grams from text in PostgreSQL but I don't understand how to use pg_trgm to extract word sequences

推荐答案

下面的函数假定一个单词由字母数字字符组成(其他所有字符均被删除),并且空格用作分隔符.

The function below assumes that a word consists of alphanumeric characters (any others are removed) and a space works as a separator.

create or replace function word_ngrams(str text, n int)
returns setof text language plpgsql as $$
declare
    i int;
    arr text[];
begin
    arr := regexp_split_to_array(str, '[^[:alnum:]]+');
    for i in 1 .. cardinality(arr)- n+ 1 loop
        return next array_to_string(arr[i : i+n-1], ' ');
    end loop;
end $$;

找到所有三个单词的短语:

Find all three-word phrases:

select word_ngrams('ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium', 3)

        word_ngrams         
----------------------------
 ed ut perspiciatis
 ut perspiciatis unde
 perspiciatis unde omnis
 unde omnis iste
 omnis iste natus
 iste natus error
 natus error sit
 error sit voluptatem
 sit voluptatem accusantium
(9 rows)

找到所有六个单词的短语:

Find all six-word phrases:

select word_ngrams('ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium', 6)

                 word_ngrams                 
---------------------------------------------
 ed ut perspiciatis unde omnis iste
 ut perspiciatis unde omnis iste natus
 perspiciatis unde omnis iste natus error
 unde omnis iste natus error sit
 omnis iste natus error sit voluptatem
 iste natus error sit voluptatem accusantium
(6 rows)

这篇关于如何从Postgres中的文本中提取n-gram单词序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆