php(模糊)搜索匹配 [英] php (fuzzy) search matching

查看:676
本文介绍了php(模糊)搜索匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果任何人曾经提交过一个故事给digg,它会检查故事是否已经提交,我假设通过模糊搜索。



我想实现类似的东西,想知道他们是否使用了开源的php类?



Soundex没有这样做,句子/字符串可以长达250个字节。

但是,您当然可以将算法应用于小型数据集。

要特别说明如何创建服务器崩溃:一对内置PHP函数将决定字符串之间的距离: levenshtein similar_text
$ b 虚拟数据:(假装他们是新闻标题)

 $ titles =<<<< EOF 
Apple
苹果
橙色
橙色
香蕉
EOF;

$ titles =爆炸(\ n,$ titles);



此时, $ titles 一串字符串。现在,创建一个矩阵,并将每个标题与每个其他标题的相似度进行比较。换句话说,对于5条标题,您将得到一个5 x 5的矩阵(25个条目)。这就是CPU和内存接收器进入的地方。

这就是为什么这种方法(通过PHP)不能应用于数千个条目。但如果你想:

 $ matches = array(); 
foreach($ title为$ title){
$ matches [$ title] = array();
foreach($ title为$ compare_to){
$ matches [$ title] [$ compare_to] = levenshtein($ compare_to,$ title);
}
asort($ matches [$ title],SORT_NUMERIC);
}

在这一点上你基本上有一个带有文本距离的矩阵。在概念上(不是真实的数据),它看起来有点像这张表。注意有一组0对角线的值 - 这意味着在匹配循环中,两个相同的单词是 - 好的,相同的。

 
苹果苹果橙子香蕉
苹果0 1 5 6 6
苹果1 0 6 5 6
橙色5 6 0 1 5
橙子6 5 1 0 5
Banana 6 6 5 5 0

实际的$ matches数组看起来有点像这样(截断):数组

[Apple] =>数组

[Apple] => 0 $ b $(




[苹果] => 1
[橙色] => 5
[香蕉] => 6
[橙子] => 6


[Apples] => Array

...

无论如何,这取决于您(通过实验)确定一个好的数字距离截止值可能大部分匹配 - 然后应用它erwise,阅读sphinx-search并使用它 - 因为它有PHP库。



你很高兴你问这个橙子吗?


if anyone has ever submitted a story to digg, it checks whether or not the story is already submitted, I assume by a fuzzy search.

I would like to implement something similar and want to know if they are using a php class that is open source?

Soundex isnt doing it, sentences/strings can be up to 250chars in length

解决方案

Unfortunately, doing this in PHP is prohibitively expensive (high CPU and memory utilization.) However, you can certainly apply the algorithm to small data sets.

To specifically expand on how you can create a server meltdown: couple of built-in PHP functions will determine "distance" between strings: levenshtein and similar_text.

Dummy data: (pretend they're news headlines)

$titles = <<< EOF
Apple
Apples
Orange
Oranges
Banana
EOF;

$titles = explode("\n", $titles );

At this point, $titles should just be an array of strings. Now, create a matrix and compare each headline against EVERY other headline for similarity. In other words, for 5 headlines, you will get a 5 x 5 matrix (25 entries.) That's where the CPU and memory sink goes in.

That's why this method (via PHP) can't be applied to thousands of entries. But if you wanted to:

$matches = array();
foreach( $titles as $title ) {
    $matches[$title] = array();
    foreach( $titles as $compare_to ) {
        $matches[$title][$compare_to] = levenshtein( $compare_to, $title );
    }
    asort( $matches[$title], SORT_NUMERIC  );
}

At this point what you basically have is a matrix with "text distances." In concept (not in real data) it looks sort of like this table below. Note how there is a set of 0 values that go diagonally - that means that in the matching loop, two identical words are -- well, identical.

       Apple Apples Orange Oranges Banana
Apple    0     1      5      6       6
Apples   1     0      6      5       6
Orange   5     6      0      1       5
Oranges  6     5      1      0       5
Banana   6     6      5      5       0

The actual $matches array looks sort of like this (truncated):

Array
(
    [Apple] => Array
        (
            [Apple] => 0
            [Apples] => 1
            [Orange] => 5
            [Banana] => 6
            [Oranges] => 6
        )

    [Apples] => Array
        (
      ...

Anyhow, it's up to you to (by experimentation) determine what a good numerical distance cutoff might mostly match - and then apply it. Otherwise, read up on sphinx-search and use it - since it does have PHP libraries.

Orange you glad you asked about this?

这篇关于php(模糊)搜索匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆