使用SQLServer包含部分单词 [英] Using SQLServer contains for partial words

查看:237
本文介绍了使用SQLServer包含部分单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用部分匹配的条形码在庞大的目录中搜索许多产品.

We are running many products search on a huge catalog with partially matched barcodes.

我们从一个简单的查询开始

We started with a simple like query

select * from products where barcode like '%2345%'

但这花费了太长时间,因为它需要全表扫描. 我们认为全文搜索将可以使用contains来帮助我们.

But that takes way too long since it requires a full table scan. We thought a fulltext search will be able to help us here using contains.

select * from products where contains(barcode, '2345')

但是,contains似乎不支持查找部分包含文本的单词,而只能查找完整的单词匹配项或前缀. (但在此示例中,我们正在寻找"123456").

But, it seems like contains doesn't support finding words that partially contains a text but, only full a word match or a prefix. (But in this example we're looking for '123456').

推荐答案

我的答案是:@DenisReznik是正确的:)

My answer is: @DenisReznik was right :)

好的,让我们看一下.
我处理条形码和大型目录已有很多年了,对此感到很好奇.

ok, let's take a look.
I have worked with barcodes and big catalogs for many years and I was curious about this question.

所以我自己做了一些测试.

So I have made some tests on my own.

我已经创建了一个表来存储测试数据:

I have created a table to store test data:

CREATE TABLE [like_test](
    [N] [int] NOT NULL PRIMARY KEY,
    [barcode] [varchar](40) NULL
) 

我知道条形码的类型很多,有些仅包含数字,另一些还包含字母,而其他甚至可能非常复杂.

I know that there are many types of barcodes, some contains only numbers, other contains also letters, and other can be even much complex.

假设我们的条形码是随机字符串.
我已经填充了1000万条随机字母数字数据记录:

Let's assume our barcode is a random string.
I have filled it with 10 millions records of random alfanumeric data:

insert into like_test
select (select count(*) from like_test)+n, REPLACE(convert(varchar(40), NEWID()), '-', '') barcode 
from FN_NUMBERS(10000000)

FN_NUMBERS()只是我在数据库中使用的功能(一种tally_table) 快速获取记录.

FN_NUMBERS() is just a function I use in my DBs (a sort of tally_table) to get records quick.

我有1000万条这样的记录:

I got 10 million records like that:

N   barcode
1   1C333262C2D74E11B688281636FAF0FB
2   3680E11436FC4CBA826E684C0E96E365
3   7763D29BD09F48C58232C7D33551E6C9

让我们声明一个要搜索的变量:

Let's declare a var to search for:

declare @s varchar(20) = 'D34F15' -- random alfanumeric string 

让我们对 Like 进行基础尝试,将结果与以下内容进行比较:

Let's take a base try with LIKE to compare results to:

select * from like_test where barcode like '%'+@s+'%'

在我的工作站上,进行完整的聚集索引扫描需要24.4秒.非常慢.

On my workstation it takes 24.4 secs for a full clustered index scan. Very slow.

SSMS建议在条形码列上添加索引:

SSMS suggests to add an index on barcode column:

CREATE NONCLUSTERED INDEX [ix_barcode] ON [like_test] ([barcode]) INCLUDE ([N])

500Mb索引,我重试选择,这次非聚集索引搜寻的时间为24.0秒.小于2%,结果几乎相同.与SSMS假设的75%相差很远.在我看来,这个指数确实不值得.也许我的SSD三星840正在发挥作用..
目前,我让索引处于活动状态.

500Mb of index, I retry the select, this time 24.0 secs for the non clustered index seek.. less than 2% better, almost the same result. Very far from the 75% supposed by SSMS. It seems to me this index really doesn't worth it. Maybe my SSD Samsung 840 is making the difference..
For the moment I let the index active.

让我们尝试 CHARINDEX 解决方案:

select * from like_test where charindex(@s, barcode) > 0

这一次花了23.5秒完成,并不比LIKE好多少.

This time it took 23.5 second to complete, not really so much better than LIKE.

现在,让我们检查@DenisReznik的建议,即使用 Binary Collat​​ion 可以加快处理速度.

Now let's check the @DenisReznik 's suggestion that using the Binary Collation should speed up things.

select * from like_test
where barcode collate Latin1_General_BIN like '%'+@s+'%' collate Latin1_General_BIN 

哇,它似乎起作用了!仅4.5秒,这令人印象深刻!好5倍..
那么,CHARINDEX和Collat​​ion在一起又如何呢?试试吧:

WOW, it seems to work! Only 4.5 secs this is impressive! 5 times better..
So, what about CHARINDEX and Collation toghether? Let's try it:

select * from like_test
where charindex(@s collate Latin1_General_BIN, barcode collate Latin1_General_BIN)>0

难以置信! 2.4秒,好10倍.

Unbelivable! 2.4 secs, 10 times better..

好的,到目前为止,我已经意识到CHARINDEX比LIKE好,并且Binary Collat​​ion比普通的字符串排序法好,因此从现在开始,我将仅继续使用CHARINDEX和Collat​​ion.

Ok, so far I have realized that CHARINDEX is better than LIKE, and that Binary Collation is better than normal string collation, so from now on I will go on only with CHARINDEX and Collation.

现在,我们还能做些其他事情以获得更好的结果吗?也许我们可以尝试减少很长的字符串..扫描始终是扫描..

Now, can we do anything else to get even better results? Maybe we can try reduce our very long strings.. a scan is always a scan..

第一次尝试,使用SUBSTRING进行逻辑字符串切割,实际上可以处理8个字符的条形码:

First try, a logical string cut using SUBSTRING to virtually works on barcodes of 8 chars:

select * from like_test
where charindex(
        @s collate Latin1_General_BIN, 
        SUBSTRING(barcode, 12, 8) collate Latin1_General_BIN
      )>0

太棒了! 1.8秒..我尝试了SUBSTRING(barcode, 1, 8)(字符串的头部)和SUBSTRING(barcode, 12, 8)(字符串的中央)都具有相同的结果.

Fantastic! 1.8 seconds.. I have tried both SUBSTRING(barcode, 1, 8) (head of the string) and SUBSTRING(barcode, 12, 8) (middle of the string) with same results.

然后,我尝试以物理方式减小条形码列的大小,与使用SUBSTRING()几乎没有区别

Then I have tried to phisically reduce the size of the barcode column, almost no difference than using SUBSTRING()

最后,我尝试删除条形码列上的索引并重复上述所有测试... 我惊讶地得到几乎相同的结果,差异很小.
Index的性能提高了3-5%,但是如果更新目录,则需要500Mb的磁盘空间和维护成本.

Finally I have tried to drop the index on barcode column and repeated ALL above tests... I was very surprised to get almost same results, with very little differences.
Index performs 3-5% better, but at cost of 500Mb of disk space and and maintenance cost if the catalog is updated.

自然地,对于像where barcode = @s这样的直接键查找,其索引需要20-50毫秒,如果没有索引,则使用归类语法where barcode collate Latin1_General_BIN = @s collate Latin1_General_BIN

Naturally, for a direct key lookup like where barcode = @s with the index it takes 20-50 millisecs, without index we can't get less than 1.1 secs using Collation syntax where barcode collate Latin1_General_BIN = @s collate Latin1_General_BIN

这很有趣.
希望对您有帮助

This was interesting.
I hope this helps

这篇关于使用SQLServer包含部分单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆