PHP中的字符串相似性:类似于levenshtein的长字符串函数 [英] String similarity in PHP: levenshtein like function for long strings

查看:83
本文介绍了PHP中的字符串相似性:类似于levenshtein的长字符串函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

PHP中的函数levenshtein适用于最大长度为255的字符串.在PHP中计算句子的相似性分数的最佳选择是什么.

The function levenshtein in PHP works on strings with maximum length 255. What are good alternatives to compute a similarity score of sentences in PHP.

基本上,我有一个句子数据库,我想找到近似的重复项. similar_text函数没有给我预期的结果.对我来说,检测如下类似句子的最简单方法是什么:

Basically I have a database of sentences, and I want to find approximate duplicates. similar_text function is not giving me expected results. What is the easiest way for me to detect similar sentences like below:

$ss="Jack is a very nice boy, isn't he?";
$pp="jack is a very nice boy is he";

$ss=strtolower($ss);  // convert to lower case as we dont care about case
$pp=strtolower($pp);

$score=similar_text($ss, $pp);
echo "$score %\n";  // Outputs just 29 %

$score=levenshtein ( $ss, $pp );
echo "$score\n";  // Outputs '5', which indicates they are very similar. But, it does not work for more than 255 chars :(

推荐答案

levenshtein算法的时间复杂度为O(n*m),其中nm是两个输入字符串的长度.这是非常昂贵的,计算长字符串的距离将花费很长时间.

The levenshtein algorithm has a time complexity of O(n*m), where n and m are the lengths of the two input strings. This is pretty expensive and computing such a distance for long strings will take a long time.

对于整个句子,您可能想使用diff算法,例如,请参见:

For whole sentences, you might want to use a diff algorithm instead, see for example: Highlight the difference between two strings in PHP

话虽如此,PHP还提供了 similar_text 函数的复杂度更高(O(max(n,m)**3)),但似乎可以在更长的字符串上使用.

Having said this, PHP also provides the similar_text function which has an even worse complexity (O(max(n,m)**3)) but seems to work on longer strings.

这篇关于PHP中的字符串相似性:类似于levenshtein的长字符串函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆