一袋文字模型:2个PHP函数,结果相同:为什么? [英] Bag of words model: 2 PHP functions, same results: Why?

查看:93
本文介绍了一袋文字模型:2个PHP函数,结果相同:为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个PHP函数来计算两个文本之间的关系。他们都使用文字模型包,但check2()更快。无论如何,这两个函数都有相同的结果。为什么? check1()使用一个包含所有单词的大型字典数组 - 如单词模型包中所述。 check2()不使用一个大数组,而是只包含一个文本单词的数组。所以check2()不应该工作,但它没有。为什么这两个函数的结果相同?

 函数check1($ terms_in_article1,$ terms_in_article2){
global $ zeit_check1 ;
$ zeit_s = microtime(TRUE);
$ length1 = count($ terms_in_article1); //数字
$ length2 = count($ terms_in_article2); //数字
$ all_terms = array_merge($ terms_in_article1,$ terms_in_article2);
$ all_terms = array_unique($ all_terms);
foreach($ all_terms as $ all_termsa){
$ term_vector1 [$ all_termsa] = 0;
$ term_vector2 [$ all_termsa] = 0;
}
foreach($ terms_in_article1 as $ terms_in_article1a){
$ term_vector1 [$ terms_in_article1a] ++;
}
foreach($ terms_in_article2 as $ terms_in_article2a){
$ term_vector2 [$ terms_in_article2a] ++;
}
$ score = 0;
foreach($ all_terms as $ all_termsa){
$ score + = $ term_vector1 [$ all_termsa] * $ term_vector2 [$ all_termsa];
}
$分数= $分数/($ length1 * $ length2);
$分数* = 500; //为了更好的可读性
$ zeit_e = microtime(TRUE);
$ zeit_check1 + =($ zeit_e- $ zeit_s);
返回$分数;
}
函数check2($ terms_in_article1,$ terms_in_article2){
global $ zeit_check2;
$ zeit_s = microtime(TRUE);
$ length1 = count($ terms_in_article1); //数字
$ length2 = count($ terms_in_article2); //数字
$ score_table = array();
foreach($ terms_in_article1 as $ term){
if(!isset($ score_table [$ term]))$ score_table [$ term] = 0;
$ score_table [$ term] + = 1;
}
$ score_table2 = array();
foreach($ terms_in_article2 as $ term){
if(isset($ score_table [$ term])){
if(!isset($ score_table2 [$ term]))$ score_table2 [ $ term] = 0;
$ score_table2 [$ term] + = 1;
}
}
$ score = 0;
foreach($ score_table2 as $ key => $ entry){
$ score + = $ score_table [$ key] * $ entry;
}
$分数= $分数/($ length1 * $ length2);
$分数* = 500;
$ zeit_e = microtime(TRUE);
$ zeit_check2 + =($ zeit_e- $ zeit_s);
返回$分数;
}

我希望你能帮助我。提前致谢!

解决方案

这两个函数都实现了几乎相同的算法,但第一个函数以直接方式执行,

check1是这样的:

 //循环长度(words1)乘以单词1中的每个单词

freq1 [word] ++

//循环长度words2)times b $ b for word2中的每个单词:
freq2 [word] ++

//循环长度(union(words1,words2))times
for each word in union(words1,words2):
score + = freq1 [word] * freq2 [word]


$ b $但是请记住:当你用零乘以某个东西时,你会得到零。



这意味着,计算不在两组中的词的频率这是浪费时间 - 我们将频率乘以零,这将不会增加分数。

check2将此考虑在内:

  //循环长度(words1)乘以单词1中的每个单词

freq1 [word] ++

//循环长度(words2)乘以单词2中的每个单词的

如果freq1 [单词]> 0:
freq2 [word] ++

// freq2中每个单词的循环长度(intersection(words1,words2))次数

score + = freq1 [word] * freq2 [word]


I have two PHP functions to calculate the relation between two texts. They both use the bag of words model but check2() is much faster. Anyway, both functions give the same results. Why? check1() uses one big dictionary array containing ALL words - as described in the bag of words model. check2() doesn't use one big array but an array containing only the words of one text. So check2() shouldn't work but it doesn. Why do both functions give the same results?

function check1($terms_in_article1, $terms_in_article2) {
    global $zeit_check1;
    $zeit_s = microtime(TRUE);
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $all_terms = array_merge($terms_in_article1, $terms_in_article2);
    $all_terms = array_unique($all_terms);
    foreach ($all_terms as $all_termsa) {
        $term_vector1[$all_termsa] = 0;
        $term_vector2[$all_termsa] = 0;
    }
    foreach ($terms_in_article1 as $terms_in_article1a) {
        $term_vector1[$terms_in_article1a]++;
    }
    foreach ($terms_in_article2 as $terms_in_article2a) {
        $term_vector2[$terms_in_article2a]++;
    }
    $score = 0;
    foreach ($all_terms as $all_termsa) {
        $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
    }
    $score = $score/($length1*$length2);
    $score *= 500; // for better readability
    $zeit_e = microtime(TRUE);
    $zeit_check1 += ($zeit_e-$zeit_s);
    return $score;
}
function check2($terms_in_article1, $terms_in_article2) {
    global $zeit_check2;
    $zeit_s = microtime(TRUE);
    $length1 = count($terms_in_article1); // number of words
    $length2 = count($terms_in_article2); // number of words
    $score_table = array();
    foreach($terms_in_article1 as $term){
        if(!isset($score_table[$term])) $score_table[$term] = 0;
        $score_table[$term] += 1;
    }
    $score_table2 = array();
    foreach($terms_in_article2 as $term){
        if(isset($score_table[$term])){
            if(!isset($score_table2[$term])) $score_table2[$term] = 0;
            $score_table2[$term] += 1;
        }
    }
    $score = 0;
    foreach($score_table2 as $key => $entry){
        $score += $score_table[$key] * $entry;
    }
    $score = $score/($length1*$length2);
    $score *= 500;
    $zeit_e = microtime(TRUE);
    $zeit_check2 += ($zeit_e-$zeit_s);
    return $score;
}

I hope you can help me. Thanks in advance!

解决方案

Both functions implement pretty much the same algorithm, but while the first one does it in straightforward way, the second one is a bit more clever and skips a portion of unneccessary work.

check1 goes like this:

// loop length(words1) times
for each word in words1:
    freq1[word]++

// loop length(words2) times
for each word in words2:
    freq2[word]++

// loop length(union(words1, words2)) times
for each word in union(words1, words2):
    score += freq1[word] * freq2[word]

But remember: when you multiply something with zero, you will get zero.

This means, that counting the frequencies of words that aren't in both sets is a waste of time - we multiply the frequency by zero and that will add nothing to the score.

check2 takes this into account:

// loop length(words1) times
for each word in words1:
    freq1[word]++

// loop length(words2) times
for each word in words2:
    if freq1[word] > 0:
        freq2[word]++

// loop length(intersection(words1, words2)) times
for each word in freq2:
    score += freq1[word] * freq2[word]

这篇关于一袋文字模型:2个PHP函数,结果相同:为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆