文本相似度算法 [英] Text similarity algorithm

查看：369 发布时间：2018/12/10 21:21:41 java text nlp levenshtein-distance similarity

本文介绍了文本相似度算法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个字幕文件。
我需要一个函数来判断它们是代表相同的文本，还是类似的文本

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text

有时会有评论风正在吹......音乐只在一个文件中播放。
但是80％的内容都是一样的。该函数必须返回TRUE（文件表示相同的文本）。
有时会出现像1而不是l（1 - L）这样的拼写错误：
她只有行李。
当然，这意味着函数必须返回TRUE。

Sometimes there are comments like "The wind is blowing... the music is playing" in one file only. But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text). And sometimes there are misspellings like 1 instead of l (one - L ) as here: She 1eft the baggage. Of course, it means function must return TRUE.

我的评论：

函数应该返回文本相似度的百分比 - 同意

My comments:
The function should return percentage of the similarity of texts - AGREE

所有人都很开心和所有人都不开心 - 这里被认为是拼写错误，因此被认为是同一文字。确切地说，函数返回的百分比将更低，但足够高以表示短语类似

"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar

请考虑是否要在整个文件上应用Levenshtein或只是一个搜索字符串 - 不确定Levenshtein，但算法必须作为一个整体应用于文件。不过，这将是一个很长的字符串。

Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.

文本相似度算法 [英] Text similarity algorithm

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

文本相似度算法 [英] Text similarity algorithm

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭