使用SQLServer包含部分单词 [英] Using SQLServer contains for partial words
问题描述
我们正在使用部分匹配的条形码在庞大的目录中搜索许多产品.
We are running many products search on a huge catalog with partially matched barcodes.
我们从一个简单的查询开始
We started with a simple like query
select * from products where barcode like '%2345%'
但这花费了太长时间,因为它需要全表扫描. 我们认为全文搜索将可以使用contains来帮助我们.
But that takes way too long since it requires a full table scan. We thought a fulltext search will be able to help us here using contains.
select * from products where contains(barcode, '2345')
但是,contains似乎不支持查找部分包含文本的单词,而只能查找完整的单词匹配项或前缀. (但在此示例中,我们正在寻找"123456").
But, it seems like contains doesn't support finding words that partially contains a text but, only full a word match or a prefix. (But in this example we're looking for '123456').
推荐答案
我的答案是:@DenisReznik是正确的:)
My answer is: @DenisReznik was right :)
好的,让我们看一下.
我处理条形码和大型目录已有很多年了,对此感到很好奇.
ok, let's take a look.
I have worked with barcodes and big catalogs for many years and I was curious about this question.
所以我自己做了一些测试.
So I have made some tests on my own.
我已经创建了一个表来存储测试数据:
I have created a table to store test data:
CREATE TABLE [like_test](
[N] [int] NOT NULL PRIMARY KEY,
[barcode] [varchar](40) NULL
)
我知道条形码的类型很多,有些仅包含数字,另一些还包含字母,而其他甚至可能非常复杂.
I know that there are many types of barcodes, some contains only numbers, other contains also letters, and other can be even much complex.
假设我们的条形码是随机字符串.
我已经填充了1000万条随机字母数字数据记录:
Let's assume our barcode is a random string.
I have filled it with 10 millions records of random alfanumeric data:
insert into like_test
select (select count(*) from like_test)+n, REPLACE(convert(varchar(40), NEWID()), '-', '') barcode
from FN_NUMBERS(10000000)
FN_NUMBERS()只是我在数据库中使用的功能(一种tally_table) 快速获取记录.
FN_NUMBERS() is just a function I use in my DBs (a sort of tally_table) to get records quick.
我有1000万条这样的记录:
I got 10 million records like that:
N barcode
1 1C333262C2D74E11B688281636FAF0FB
2 3680E11436FC4CBA826E684C0E96E365
3 7763D29BD09F48C58232C7D33551E6C9
让我们声明一个要搜索的变量:
Let's declare a var to search for:
declare @s varchar(20) = 'D34F15' -- random alfanumeric string
让我们对 Like 进行基础尝试,将结果与以下内容进行比较:
Let's take a base try with LIKE to compare results to:
select * from like_test where barcode like '%'+@s+'%'
在我的工作站上,进行完整的聚集索引扫描需要24.4秒.非常慢.
On my workstation it takes 24.4 secs for a full clustered index scan. Very slow.
SSMS建议在条形码列上添加索引:
SSMS suggests to add an index on barcode column:
CREATE NONCLUSTERED INDEX [ix_barcode] ON [like_test] ([barcode]) INCLUDE ([N])
500Mb索引,我重试选择,这次非聚集索引搜寻的时间为24.0秒.小于2%,结果几乎相同.与SSMS假设的75%相差很远.在我看来,这个指数确实不值得.也许我的SSD三星840正在发挥作用..
目前,我让索引处于活动状态.
500Mb of index, I retry the select, this time 24.0 secs for the non clustered index seek.. less than 2% better, almost the same result. Very far from the 75% supposed by SSMS. It seems to me this index really doesn't worth it. Maybe my SSD Samsung 840 is making the difference..
For the moment I let the index active.
让我们尝试 CHARINDEX 解决方案:
select * from like_test where charindex(@s, barcode) > 0
这一次花了23.5秒完成,并不比LIKE好多少.
This time it took 23.5 second to complete, not really so much better than LIKE.
现在,让我们检查@DenisReznik的建议,即使用 Binary Collation 可以加快处理速度.
Now let's check the @DenisReznik 's suggestion that using the Binary Collation should speed up things.
select * from like_test
where barcode collate Latin1_General_BIN like '%'+@s+'%' collate Latin1_General_BIN
哇,它似乎起作用了!仅4.5秒,这令人印象深刻!好5倍..
那么,CHARINDEX和Collation在一起又如何呢?试试吧:
WOW, it seems to work! Only 4.5 secs this is impressive! 5 times better..
So, what about CHARINDEX and Collation toghether? Let's try it:
select * from like_test
where charindex(@s collate Latin1_General_BIN, barcode collate Latin1_General_BIN)>0
难以置信! 2.4秒,好10倍.
Unbelivable! 2.4 secs, 10 times better..
好的,到目前为止,我已经意识到CHARINDEX比LIKE好,并且Binary Collation比普通的字符串排序法好,因此从现在开始,我将仅继续使用CHARINDEX和Collation.
Ok, so far I have realized that CHARINDEX is better than LIKE, and that Binary Collation is better than normal string collation, so from now on I will go on only with CHARINDEX and Collation.
现在,我们还能做些其他事情以获得更好的结果吗?也许我们可以尝试减少很长的字符串..扫描始终是扫描..
Now, can we do anything else to get even better results? Maybe we can try reduce our very long strings.. a scan is always a scan..
第一次尝试,使用SUBSTRING进行逻辑字符串切割,实际上可以处理8个字符的条形码:
First try, a logical string cut using SUBSTRING to virtually works on barcodes of 8 chars:
select * from like_test
where charindex(
@s collate Latin1_General_BIN,
SUBSTRING(barcode, 12, 8) collate Latin1_General_BIN
)>0
太棒了! 1.8秒..我尝试了SUBSTRING(barcode, 1, 8)
(字符串的头部)和SUBSTRING(barcode, 12, 8)
(字符串的中央)都具有相同的结果.
Fantastic! 1.8 seconds.. I have tried both SUBSTRING(barcode, 1, 8)
(head of the string) and SUBSTRING(barcode, 12, 8)
(middle of the string) with same results.
然后,我尝试以物理方式减小条形码列的大小,与使用SUBSTRING()几乎没有区别
Then I have tried to phisically reduce the size of the barcode column, almost no difference than using SUBSTRING()
最后,我尝试删除条形码列上的索引并重复上述所有测试...
我惊讶地得到几乎相同的结果,差异很小.
Index的性能提高了3-5%,但是如果更新目录,则需要500Mb的磁盘空间和维护成本.
Finally I have tried to drop the index on barcode column and repeated ALL above tests...
I was very surprised to get almost same results, with very little differences.
Index performs 3-5% better, but at cost of 500Mb of disk space and and maintenance cost if the catalog is updated.
自然地,对于像where barcode = @s
这样的直接键查找,其索引需要20-50毫秒,如果没有索引,则使用归类语法where barcode collate Latin1_General_BIN = @s collate Latin1_General_BIN
Naturally, for a direct key lookup like where barcode = @s
with the index it takes 20-50 millisecs, without index we can't get less than 1.1 secs using Collation syntax where barcode collate Latin1_General_BIN = @s collate Latin1_General_BIN
这很有趣.
希望对您有帮助
This was interesting.
I hope this helps
这篇关于使用SQLServer包含部分单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!