查找包含相似字符串的sql记录 [英] Find sql records containing similar strings

查看:54
本文介绍了查找包含相似字符串的sql记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 2 列的下表:包含超过 500.000 条记录的 ID 和标题.例如:

I have the following table with 2 columns: ID and Title containing over 500.000 records. For example:

ID  Title
--  ------------------------
1   Aliens
2   Aliens (1986)
3   Aliens vs Predator
4   Aliens 2
5   The making of "Aliens"

我需要找到非常相似的记录,我的意思是它们相差 3-6 个字母,通常这种区别在标题的末尾.所以我必须设计一个返回记录号的查询.1,2 和 4.我已经看过 levenstein 距离,但我不知道如何应用它.同样由于记录的数量,查询不应该整夜进行.

I need to find records that are very similar, and by that I mean they are different by 3-6 letters, usually this difference is at the end of the Titles. So I have to design a query that returns the records no. 1,2 and 4. I already looked at levenstein distance but I don't know how to apply it. Also because of the number of records the query shouldn't take all night long.

感谢您的任何想法或建议

Thanks for any idea or suggestion

推荐答案

如果您真的想以您在问题中表述的确切方式定义相似性,那么您将 - 正如您所说 - 必须实施 Levensthein Distance计算.在 DataReader 检索的每一行上计算的代码中或作为 SQL Server 函数.

If you really want to define similarity in the exact way that you have formulated in your question, then you would - as you say - have to implement the Levensthein Distance calculation. Either in code calculated on each row retrieved by a DataReader or as a SQL Server function.

所陈述的问题实际上比乍一看更棘手,因为您无法假设知道两个字符串之间的相互共享元素可能是什么.

The problem stated is actually more tricky than it may appear at first sight, because you cannot assume to know what the mutually shared elements between two strings may be.

因此,除了 Levensthein 距离之外,您可能还想指定实际必须匹配的最少连续字符数(以便得出足够的相似性).

So in addition to Levensthein Distance you probably also want to specify a minimum number of consecutive characters that actually have to match (in order for sufficient similarity to be concluded).

总而言之:这听起来过于复杂且耗时/缓慢.

In sum: It sounds like an overly complicated and time consuming/slow approach.

有趣的是,在 SQL Server 2008 中,您有 DIFFERENCE 函数可以用于类似的事情.

Interestingly, in SQL Server 2008 you have the DIFFERENCE function which may be used for something like this.

它评估两个字符串的语音值并计算差异.我不确定你是否能让它适用于多词表达,比如电影片名,因为它不能很好地处理空格或数字,并且过分强调字符串的开头,但它仍然很有趣要注意的谓词.

It evaluates the phonetic value of two strings and calculates the difference. I'm unsure if you will get it to work properly for multi-word expressions such as movie titles since it doesn't deal well with spaces or numbers and puts too much emphasis on the beginning of the string, but it is still an interesting predicate to be aware of.

如果您实际上想要描述的是某种搜索功能,那么您应该查看SQL Server 2008 的全文搜索功能.它提供内置的同义词支持,花式 SQL 谓词和最佳匹配"的排名机制

If what you are actually trying to describe is some sort of search feature, then you should look into the Full Text Search capabilities of SQL Server 2008. It provides built-in Thesaurus support, fancy SQL predicates and a ranking mechanism for "best matches"

如果您想消除重复项,也许您可​​以查看 SSIS Fuzzy查找和模糊组转换.我自己没有尝试过,但它看起来很有希望.

If you are looking to eliminate duplicates maybe you could look into SSIS Fuzzy Lookup and Fuzzy Group Transformation. I have not tried this myself, but it looks like a promising lead.

如果您不想深入研究 SSIS 并且仍然为 Levensthein 距离算法的性能而苦恼,您可以试试这个 算法 似乎不那么复杂.

If you don't want to dig into SSIS and still struggle with the performance of the Levensthein Distance algorithm, you could perhaps try this algorithm which appears to be less complex.

这篇关于查找包含相似字符串的sql记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆