函数返回文本之间的亲和力? [英] Function that returns affinity between texts?

查看:231
本文介绍了函数返回文本之间的亲和力?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑我有一个

string1 = "hello hi goodmorning evening [...]"

和我有一些轻微的关键字

and I have some minor keywords

compare1 = "hello evening"
compare2 = "hello hi"

我需要一个函数,返回的文字和关键字之间的亲和力。例如:

I need a function that returns the affinity between the text and keywords. Example:

function(string1,compare1);  // returns: 4
function(string1,compare2);  // returns: 5 (more relevant)

请注意5和4只是举例。

Please note 5 and 4 are just for example.

您可以说 - 编写一个函数,计算出现 - 但在这个例子中这是行不通的,因为两者有2次出现,但比较1,因为你好晚上是不完全的字符串1(2个字中是不太相关打招呼和晚上比你好喜更远)

You could say - write a function that counts occurrences - but for this example this would not work because both got 2 occurrences, but compare1 is less relevant because "hello evening" isn't exactly found in string1 (the 2 words hello and evening are more distant than hello hi)

是否有任何已知的算法来做到这一点?

are there any known-algorithm to do this?

ADD1:

像编辑距离交易算法在这种情况下是行不通的。 因为字符串1是一个完整的文本(如300-400字),比较字符串最多4-5个字。

algos like Edit Distance in this case would NOT work. Because string1 is a complete text (like 300-400 words) and the comparing strings are max 4-5 word.

推荐答案

看来你正在寻找的是非常相似,在 Smith-Waterman算法 一样。

A Dynamic Programing Algorithm

It seems what you are looking for is very similar to what the Smith–Waterman algorithm does.

维基百科:

该算法最早由寺F. Smith和迈克尔S沃特曼在1981年像的Needleman-Wunsch的算法,它是一个变化,史密斯 - 沃特曼是一个动态规划算法。因此,它具有它是保证找到最佳局部比对相对于该评分系统被使用(包括替换矩阵和间隙得分方案)的期望的属性。

The algorithm was first proposed by Temple F. Smith and Michael S. Waterman in 1981. Like the Needleman-Wunsch algorithm, of which it is a variation, Smith-Waterman is a dynamic programming algorithm. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme).

让我们看一个实际的例子,这样你就可以评估其有效性。

Let's see a practical example, so you can evaluate its usefulness.

假设我们有一个文本:

text = "We the people of the United States, in order to form a more 
perfect union, establish justice, insure domestic tranquility, 
provide for the common defense, 

  promote the general welfare, 

  and secure the blessings of liberty to ourselves and our posterity, 
do ordain and establish this Constitution for the United States of 
America.";  

我隔离段,我们要匹配,只为你轻松的阅读。

I isolated the segment we are going to match, just for your easy of reading.

我们将比较亲和(或相似)与字符串列表:

We will compare the affinity (or similarity) with a list of strings:

list = {
   "the general welfare",
   "my personal welfare",
   "general utopian welfare",
   "the general",
   "promote welfare",
   "stackoverflow rulez"
   };  

我的算法已经实施,所以我会计算相似度和规范的结果:

I have the algorithm already implemented, so I'll calculate the similarity and normalize the results:

sw = SmithWatermanSimilarity[ text, #] & /@ list;
swN = (sw - Min[sw])/(Max[sw] - Min[sw])  

然后我们绘制的结果:

Then we Plot the results:

我认为这是非常相似,您预期的结果。

I think it's very similar to your expected result.

心连心!

一些实现(W /源$ C ​​$ C)

这篇关于函数返回文本之间的亲和力?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆