Java中的相似性字符串比较 [英] Similarity String Comparison in Java
问题描述
我想比较几个字符串,并找出最相似的字符串.我想知道是否有任何库、方法或最佳实践可以让我返回哪些字符串与其他字符串更相似.例如:
I want to compare several strings to each other, and find the ones that are the most similar. I was wondering if there is any library, method or best practice that would return me which strings are more similar to other strings. For example:
- 敏捷的狐狸跳了"->狐狸跳了"
- 敏捷的狐狸跳了起来"->狐狸"
这个比较会返回第一个比第二个更相似.
This comparison would return that the first is more similar than the second.
我想我需要一些方法,例如:
I guess I need some method such as:
double similarityIndex(String s1, String s2)
某处有这样的东西吗?
我为什么要这样做?我正在编写一个脚本,将 MS Project 文件的输出与处理任务的某些遗留系统的输出进行比较.因为遗留系统的字段宽度非常有限,所以在添加值时,描述会被缩写.我想要一些半自动的方法来查找 MS Project 中的哪些条目与系统上的条目相似,以便我可以获得生成的密钥.它有缺点,因为它仍然必须手动检查,但它会节省很多工作
Why am I doing this? I am writing a script that compares the output of a MS Project file to the output of some legacy system that handles tasks. Because the legacy system has a very limited field width, when the values are added the descriptions are abbreviated. I want some semi-automated way to find which entries from MS Project are similar to the entries on the system so I can get the generated keys. It has drawbacks, as it has to be still manually checked, but it would save a lot of work
推荐答案
是的,有许多有据可查的算法,例如:
Yes, there are many well documented algorithms like:
- 余弦相似度
- Jaccard 相似度
- 骰子的系数
- 匹配相似度
- 重叠相似度
- 等等等等
一个很好的总结(Sam 的字符串度量")可以在这里找到(原始链接已失效,因此它链接到 Internet Archive)
A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)
还要检查这些项目:
这篇关于Java中的相似性字符串比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!