文字和关键字列表之间的相似性? [英] Affinity between a text and a list of keywords?

查看:68
本文介绍了文字和关键字列表之间的相似性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一部分文字(500-1500个字符)

I have a portion of text (500-1500 chars)

我有一个关键字列表(1000条记录).

And I have a list of keywords (1000 records)..

我应该怎么做才能从该列表中找到与给定文本相关的关键字?

What should I do to find the keywords from that list that are related to my given text?

我当时想搜索列表中每个关键字的文本中这些关键字的出现频率,但是我认为这有点昂贵"

I was thinking to search the occorences of those keywords in my text for every keywords in the list, but it's a bit "expensive" i think

谢谢

推荐答案

我把帽子戴在戒指里……

I throw my hat in the ring …

function extractWords($text, $minWordLength = null, array $stopwords = array(), $caseIgnore = true)
{
    $pattern = '/\w'. (is_null($minWordLength) ? '+' : '{'.$minWordLength.',}') .'/';
    $matches = array();
    preg_match_all($pattern, $text, $matches);
    $words = $matches[0];

    if ($caseIgnore) {
        $words = array_map('strtolower', $words);
        $stopWords = array_map('strtolower', $stopwords);
    }

    $words = array_diff($words, $stopwords);

    return $words;
}

function countKeywords(array $words, array $keywords, $threshold = null, $caseIgnore = true) 
{   
    if ($caseIgnore) {
        $keywords = array_map('strtolower', $keywords);
    }

    $words = array_intersect($words, $keywords);
    $counts = array_count_values($words);
    arsort($counts, SORT_NUMERIC);

    if (!is_null($threshold)) {
        $counts = array_filter($counts, function ($count) use ($threshold) { return $count >= $threshold; });
    }

    return $counts;
}

用法:

$text = 'a b c a';  // your text
$keywords = array('a', 'b');  // keywords from your database

$words = extractWords($text);
$count = countKeywords($words, $keywords);
print_r($count);

$total = array_sum($count);
var_dump($total);

$affinity = ($total == 0 ? 0 : 1 / (count($words) / $total));
var_dump($affinity);

打印

数组 ( [a] => 2 [b] => 1 )
int(3)
浮动(0.75)

Array ( [a] => 2 [b] => 1 )
int(3)
float(0.75)

这篇关于文字和关键字列表之间的相似性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆