使用levenshtein距离的两个全文相似度 [英] two whole texts similarity using levenshtein distance

查看:496
本文介绍了使用levenshtein距离的两个全文相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文本文件要比较。我所做的是:

I have two text files which I'd like to compare. What I did is:


  1. 我将它们分成句子。

  2. 我ve测量了一个文件中每个句子与第二个文件中每个句子之间的levenshtein距离。

我想计算这两个文本文件之间的平均相似度,但是我很难传递任何有意义的值-显然算术平均值(所有距离的总和除以比较数)是一个坏主意。

I'd like to calculate average similarity between those two text files, however I have trouble to deliver any meaningful value - obviously arithmetic mean (sum of all the distances [normalized] divided by number of comparisions) is a bad idea.

如何解释此类结果?

编辑:
距离值已标准化。

edit: Distance values are normalized.

推荐答案

levenshtein距离具有最大值,即最大值。两个输入字符串的长度。没有比这更糟的了。因此,可以将两个字符串a和b的归一化相似度索引(0 =错误,1 =匹配)计算为1- distance(a,b)/ max(a.length,b.length)。

The levenshtein distances has a maximum value, i.e. the max. length of both input strings. It cannot get worse than that. So a normalized similarity index (0=bad, 1=match) for two strings a and b can be calculated as 1- distance(a,b)/max(a.length, b.length).

从文件A中提取一个句子。您说过要将其与文件B的每个句子进行比较。我想您是从B中寻找一个距离最小(即距离最大)的句子。

Take one sentence from File A. You said you'd compare this to each sentence of File B. I guess you are looking for a sentence out of B which has the smallest distance (i.e. the highest similarity index).

简单地计算所有最小相似性指标的平均值。这应该使您对两个文本的相似性有一个粗略的估计。

Simply calculate the average of all those 'minimum similarity indexes'. This should give you a rough estimation of the similarity of two texts.

但是,是什么让您认为两个相似的文本可能被改组了呢?我个人的观点是,您还应该引入停用词列表,同义词和所有类似内容。

But what makes you think that two texts which are similar might have their sentences shuffled? My personal opinion is that you should also introduce stop word lists, synonyms and all that.

不过:请同时检查三字组匹配,这可能是解决您的问题的另一种好方法寻找。

Nevertheless: Please also check trigram matching which might be another good approach to what you are looking for.

这篇关于使用levenshtein距离的两个全文相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆