SQL选择文本字段中包含子字符串的行 [英] SQL select rows containing substring in text field

查看：343 发布时间：2020/5/30 1:45:08 sql postgresql pattern-matching

本文介绍了SQL选择文本字段中包含子字符串的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有带有列的CLIENTS_WORDS表：Postgresql数据库中的ID，CLIENT_ID，WORD

I have CLIENTS_WORDS table with columns: ID, CLIENT_ID, WORD in Postgresql database

ID|CLIENT_ID|WORD
1 |1242     |word1
2 |1242     |WordX.foo
3 |1372     |nextword
4 |1999     |word1

在此表中可能约有100k-500k行。

我有这样的查询字符串：

In this table possible about 100k-500k rows.
I have query string like this:

'Some people tell word1 to someone'
'Another stringWordX.foo too possible'

我希望从查询字符串中包含WORD列文本的表中选择*。

现在，我使用select

I wish select * from table where WORD column text contains in query string.
Now I use select

select * from CLIENTS_WORDS
where strpos('Some people tell word1 to someone', WORD) > 0

我的问题是，检索匹配行的最佳性能/最快方法在哪里？

My question, where is the best perfomance/fast way to retrieve matched rows?

推荐答案

使用 unnest（） 并加入。像这样：

You get better performance with unnest() and JOIN. Like this:

SELECT DISTINCT c.client_id
FROM   unnest(string_to_array('Some people tell word1 ...', ' ')) AS t(word)
JOIN   clients_words c USING (word);

查询的详细信息取决于缺少的需求详细信息。这是将字符串拆分为 space 个字符。

Details of the query depend on missing details of your requirements. This is splitting the string at space characters.

一个更灵活的工具是 regexp_split_to_table（） ，您可以在其中使用分隔符的字符类或简写。像这样：

A more flexible tool would be regexp_split_to_table(), where you can use character classes or shorthands for your delimiter characters. Like:

regexp_split_to_table('Some people tell word1 to someone', '\s') AS t(word)
regexp_split_to_table('Some people tell word1 to someone', '\W') AS t(word)

相关答案： Django。 PostgreSQL。 regexp_split_to_table无效

A搜索正则表达式类速记的更多答案。

Related answer: Django. PostgreSQL. regexp_split_to_table not working
A search for more answers for regular expression class shorthands.

当然列 clients_words.word 需要为性能建立索引：

Of course the column clients_words.word needs to be indexed for performance:

CREATE INDEX clients_words_word_idx ON clients_words (word)

很快。

如果您想完全忽略单词边界，则整个问题变得更加昂贵。想到 Like / ILIKE 与三字母GIN索引的组合。这里的详细信息：

PostgreSQL LIKE查询性能差异

If you want to ignore word boundaries altogether, the whole matter becomes much more expensive. LIKE / ILIKE in combination with a trigram GIN index would come to mind. Details here:
PostgreSQL LIKE query performance variations

或其他模式匹配技术-dba.SE上的答案：

与LIKE，SIMILAR TO或PostgreSQL中的正则表达式匹配的模式

Or other pattern-matching techniques - answer on dba.SE:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

但是，您的情况是倒退，索引将无济于事。您必须检查每一行是否有部分匹配，这会使查询非常昂贵。最好的方法是 reverse 操作：拆分单词并 then 搜索。

However, your case is backwards and the index is not going to help. You'll have to inspect every single row for a partial match - making queries very expensive. The superior approach is to reverse the operation: split words and then search.

这篇关于SQL选择文本字段中包含子字符串的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SQL选择文本字段中包含子字符串的行 [英] SQL select rows containing substring in text field

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

SQL选择文本字段中包含子字符串的行 [英] SQL select rows containing substring in text field

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭