php(模糊)搜索匹配 [英] php (fuzzy) search matching
问题描述
如果任何人曾经提交过一个故事给digg,它会检查故事是否已经提交,我假设通过模糊搜索。
我想实现类似的东西,想知道他们是否使用了开源的php类?
Soundex没有这样做,句子/字符串可以长达250个字节。
但是,您当然可以将算法应用于小型数据集。要特别说明如何创建服务器崩溃:一对内置PHP函数将决定字符串之间的距离: levenshtein 和 similar_text 。
$ b 虚拟数据:(假装他们是新闻标题)
$ titles =<<<< EOF
Apple
苹果
橙色
橙色
香蕉
EOF;
$ titles =爆炸(\ n,$ titles);
此时, $ titles 一串字符串。现在,创建一个矩阵,并将每个标题与每个其他标题的相似度进行比较。换句话说,对于5条标题,您将得到一个5 x 5的矩阵(25个条目)。这就是CPU和内存接收器进入的地方。
这就是为什么这种方法(通过PHP)不能应用于数千个条目。但如果你想:
$ matches = array();
foreach($ title为$ title){
$ matches [$ title] = array();
foreach($ title为$ compare_to){
$ matches [$ title] [$ compare_to] = levenshtein($ compare_to,$ title);
}
asort($ matches [$ title],SORT_NUMERIC);
}
在这一点上你基本上有一个带有文本距离的矩阵。在概念上(不是真实的数据),它看起来有点像这张表。注意有一组0对角线的值 - 这意味着在匹配循环中,两个相同的单词是 - 好的,相同的。
苹果苹果橙子香蕉
苹果0 1 5 6 6
苹果1 0 6 5 6
橙色5 6 0 1 5
橙子6 5 1 0 5
Banana 6 6 5 5 0
实际的$ matches数组看起来有点像这样(截断):数组
(
[Apple] =>数组
(
[Apple] => 0 $ b $(
[苹果] => 1
[橙色] => 5
[香蕉] => 6
[橙子] => 6
)
[Apples] => Array
(
...
无论如何,这取决于您(通过实验)确定一个好的数字距离截止值可能大部分匹配 - 然后应用它erwise,阅读sphinx-search并使用它 - 因为它有PHP库。
你很高兴你问这个橙子吗?
if anyone has ever submitted a story to digg, it checks whether or not the story is already submitted, I assume by a fuzzy search.
I would like to implement something similar and want to know if they are using a php class that is open source?
Soundex isnt doing it, sentences/strings can be up to 250chars in length
Unfortunately, doing this in PHP is prohibitively expensive (high CPU and memory utilization.) However, you can certainly apply the algorithm to small data sets.
To specifically expand on how you can create a server meltdown: couple of built-in PHP functions will determine "distance" between strings: levenshtein and similar_text.
Dummy data: (pretend they're news headlines)
$titles = <<< EOF Apple Apples Orange Oranges Banana EOF;$titles = explode("\n", $titles );
At this point, $titles should just be an array of strings. Now, create a matrix and compare each headline against EVERY other headline for similarity. In other words, for 5 headlines, you will get a 5 x 5 matrix (25 entries.) That's where the CPU and memory sink goes in.
That's why this method (via PHP) can't be applied to thousands of entries. But if you wanted to:
$matches = array(); foreach( $titles as $title ) { $matches[$title] = array(); foreach( $titles as $compare_to ) { $matches[$title][$compare_to] = levenshtein( $compare_to, $title ); } asort( $matches[$title], SORT_NUMERIC ); }
At this point what you basically have is a matrix with "text distances." In concept (not in real data) it looks sort of like this table below. Note how there is a set of 0 values that go diagonally - that means that in the matching loop, two identical words are -- well, identical.
Apple Apples Orange Oranges Banana Apple 0 1 5 6 6 Apples 1 0 6 5 6 Orange 5 6 0 1 5 Oranges 6 5 1 0 5 Banana 6 6 5 5 0
The actual $matches array looks sort of like this (truncated):
Array ( [Apple] => Array ( [Apple] => 0 [Apples] => 1 [Orange] => 5 [Banana] => 6 [Oranges] => 6 ) [Apples] => Array ( ...
Anyhow, it's up to you to (by experimentation) determine what a good numerical distance cutoff might mostly match - and then apply it. Otherwise, read up on sphinx-search and use it - since it does have PHP libraries.
Orange you glad you asked about this?
这篇关于php(模糊)搜索匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!