文本相似度算法 [英] Text similarity algorithm

查看:369
本文介绍了文本相似度算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个字幕文件。
我需要一个函数来判断它们是代表相同的文本,还是类似的文本

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text

有时会有评论风正在吹......音乐只在一个文件中播放。
但是80%的内容都是一样的。该函数必须返回TRUE(文件表示相同的文本)。
有时会出现像1而不是l(1 - L)这样的拼写错误:
她只有行李
当然,这意味着函数必须返回TRUE。

Sometimes there are comments like "The wind is blowing... the music is playing" in one file only. But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text). And sometimes there are misspellings like 1 instead of l (one - L ) as here: She 1eft the baggage. Of course, it means function must return TRUE.

我的评论:

函数应该返回文本相似度的百分比 - 同意

My comments:
The function should return percentage of the similarity of texts - AGREE

所有人都很开心和所有人都不开心 - 这里被认为是拼写错误,因此被认为是同一文字。确切地说,函数返回的百分比将更低,但足够高以表示短语类似

"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar

请考虑是否要在整个文件上应用Levenshtein或只是一个搜索字符串 - 不确定Levenshtein,但算法必须作为一个整体应用于文件。不过,这将是一个很长的字符串。

Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.

推荐答案

Levenshtein算法: http://en.wikipedia.org/wiki/Levenshtein_distance

Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

除了结果为零表示文本不相同。 相似衡量它们的距离/附近。结果是一个整数。

Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.

这篇关于文本相似度算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆