如何自动将MYSQL中的公司名称与PHP进行模糊匹配以实现自动完成? [英] How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?
问题描述
我的用户将通过剪切导入并粘贴一个包含公司名称的大字符串.
My users will import through cut and paste a large string that will contain company names.
我有一个公司名称的现有且不断发展的MYSQL数据库,每个数据库都有唯一的company_id.
I have an existing and growing MYSQL database of companies names, each with a unique company_id.
我希望能够解析该字符串并将模糊匹配分配给每个用户输入的公司名称.
I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.
现在,仅进行直线字符串匹配也很慢. ** Soundex索引编制会更快吗?我如何在用户输入时给他们一些选择? **
Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **
例如,某人写道:
Microsoft -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc. -> Polycom
我发现以下线程似乎与此问题相似,但发布者尚未批准,我不确定它们的用例是否适用:
I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:
推荐答案
您可以先使用 SOUNDEX()
,这可能会满足您的需求(我为用户键入的内容提供了一个自动建议框,其中已经存在其他选择).
You can start with using SOUNDEX()
, this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).
SOUNDEX()
的缺点是:
- 无法区分较长的字符串.仅考虑前几个字符,结尾处较长的较长字符串会产生相同的SOUNDEX值
- 第一个字母必须相同,否则您将很难找到匹配的事实. SQL Server具有DIFFERENCE()函数来告诉您两个SOUNDEX值相隔多少,但是我认为MySQL并没有内置任何此类值.
- 对于MySQL,至少根据文档,用于Unicode输入的SOUNDEX已损坏
- its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
- the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
- for MySQL, at least according to the docs, SOUNDEX is broken for unicode input
示例:
SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')
/* all of these return 'M262' */
对于更高级的需求,我认为您需要查看 Levenshtein距离(也称为两个字符串的编辑距离",并使用阈值.这是较复杂(较慢)的解决方案,但可以提供更大的灵活性.
For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.
主要缺点是,您需要两个字符串来计算它们之间的距离.使用SOUNDEX,您可以将预先计算的SOUNDEX存储在表中,并在此表上进行比较/排序/分组/过滤.使用Levenshtein距离,您可能会发现"Microsoft"和"Nzcrosoft"之间的差异仅为2,但要花费更多的时间才能得出该结果.
Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.
无论如何,可以在 codejanitor.com上找到MySQL的Levenshtein距离函数示例:Levenshtein距离作为MySQL存储函数(2007年2月10日).
这篇关于如何自动将MYSQL中的公司名称与PHP进行模糊匹配以实现自动完成?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!