在SQL中实现子字符串搜索的最佳方式是什么? [英] What is the best way to implement a substring search in SQL?

查看:129
本文介绍了在SQL中实现子字符串搜索的最佳方式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在这里有一个简单的SQL问题。在varchar列中,我们想要在字段中的任何位置搜索字符串。实现这种性能的最佳方式是什么?显然索引不会帮助这里,任何其他技巧?

我们使用MySQL并拥有约300万条记录。我们需要每秒执行很多这些查询,所以真的想要以最佳性能实现这些查询。



到目前为止,最简单的方法是:
$ b $ pre $ 从表中选择*如'%search%'






我应该进一步指定该列实际上是一个类似sadfasdfwerwe的长字符串,我必须在此列中搜索asdf。 所以他们不是句子,并试图匹配他们中的一个词。全文搜索仍然有帮助吗?

解决方案

查看我的演示文稿在MySQL中实用全文搜索



我比较:



今天我要用的是 Apache Solr ,它将Lucene带入一系列额外的功能和工具中。



< hr>

请回复您的评论:Aha,好的,不。我提到的全文搜索功能没有一个能够提供帮助,因为它们都采用某种字界

另一种有效查找任意子字符串的方式是 N-gram 方法。基本上,创建一个所有可能的N个字母序列的索引,并指向每个相应序列出现的字符串。通常这是在N = 3或 trigram 完成的,因为它是匹配之间的一个折衷点更长的子字符串,并保持索引的可管理的大小。

我不知道任何支持N-gram索引的SQL数据库都是透明的,但您可以自己设置它使用倒排索引

  create table trigrams(
trigram char(3)primary键
);

create table trigram_matches(
trigram char(3),
document_id int,
主键(trigram,document_id),
外键(trigram)引用trigrams(trigram),
外键(document_id)引用mytable(document_id)
);

现在以最困难的方式填充它:

  insert into trigram_matches 
从trigrams中选择t.trigram,d.document_id
在d.textcolumn中加入mytable d
like concat('%', t.trigram,'%');

当然这需要相当长一段时间!但是一旦完成,您可以更快速地搜索:

  select d。* 
from mytable d join trigram_matches t
on t.document_id = d.document_id
其中t.trigram ='abc'

当然,您可以搜索超过三个字符的模式,但倒排索引仍然有助于缩小搜索范围:

  select d。* 
from mytable d join trigram_matches t
on t.document_id = d.document_id
where t.trigram ='abc'
和d.textcolumn like '%ABCDEF%';


We have a simple SQL problem here. In a varchar column, we wanted to search for a string anywhere in the field. What is the best way to implement this for performance? Obviously an index is not going to help here, any other tricks?

We are using MySQL and have about 3 million records. We need to execute many of these queries per second so really trying to implement these with the best performance.

The most simple way to do this is so far is:

Select * from table where column like '%search%'


I should further specify that the column is actually a long string like "sadfasdfwerwe" and I have to search for "asdf" in this column. So they are not sentences and trying to match a word in them. Would full text search still help here?

解决方案

Check out my presentation Practical Fulltext Search in MySQL.

I compared:

Today what I would use is Apache Solr, which puts Lucene into a service with a bunch of extra features and tools.


Re your comment: Aha, okay, no. None of the fulltext search capabilities I mentioned are going to help, since they all assume some kind of word boundaries

The other way to efficiently find arbitrary substrings is the N-gram approach. Basically, create an index of all possible sequences of N letters and point to the strings where each respective sequence occurs. Typically this is done with N=3, or a trigram, because it's a point of compromise between matching longer substrings and keeping the index to a manageable size.

I don't know of any SQL database that supports N-gram indexing transparently, but you could set it up yourself using an inverted index:

create table trigrams (
  trigram char(3) primary key
);

create table trigram_matches (
  trigram char(3),
  document_id int,
  primary key (trigram, document_id),
  foreign key (trigram) references trigrams(trigram),
  foreign key (document_id) references mytable(document_id)
);

Now populate it the hard way:

insert into trigram_matches
  select t.trigram, d.document_id
  from trigrams t join mytable d
    on d.textcolumn like concat('%', t.trigram, '%');

Of course this will take quite a while! But once it's done, you can search much more quickly:

select d.*
from mytable d join trigram_matches t
  on t.document_id = d.document_id
where t.trigram = 'abc'

Of course you could be searching for patterns longer than three characters, but the inverted index still helps to narrow your search a lot:

select d.*
from mytable d join trigram_matches t
  on t.document_id = d.document_id
where t.trigram = 'abc'
  and d.textcolumn like '%abcdef%';

这篇关于在SQL中实现子字符串搜索的最佳方式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆