Java中的相似性字符串比较 [英] Similarity String Comparison in Java

查看:45
本文介绍了Java中的相似性字符串比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想比较几个字符串,并找出最相似的字符串.我想知道是否有任何库、方法或最佳实践可以让我返回哪些字符串与其他字符串更相似.例如:

I want to compare several strings to each other, and find the ones that are the most similar. I was wondering if there is any library, method or best practice that would return me which strings are more similar to other strings. For example:

  • 敏捷的狐狸跳了"->狐狸跳了"
  • 敏捷的狐狸跳了起来"->狐狸"

这个比较会返回第一个比第二个更相似.

This comparison would return that the first is more similar than the second.

我想我需要一些方法,例如:

I guess I need some method such as:

double similarityIndex(String s1, String s2)

某处有这样的东西吗?

我为什么要这样做?我正在编写一个脚本,将 MS Project 文件的输出与处理任务的某些遗留系统的输出进行比较.因为遗留系统的字段宽度非常有限,所以在添加值时,描述会被缩写.我想要一些半自动的方法来查找 MS Project 中的哪些条目与系统上的条目相似,以便我可以获得生成的密钥.它有缺点,因为它仍然必须手动检查,但它会节省很多工作

Why am I doing this? I am writing a script that compares the output of a MS Project file to the output of some legacy system that handles tasks. Because the legacy system has a very limited field width, when the values are added the descriptions are abbreviated. I want some semi-automated way to find which entries from MS Project are similar to the entries on the system so I can get the generated keys. It has drawbacks, as it has to be still manually checked, but it would save a lot of work

推荐答案

是的,有许多有据可查的算法,例如:

Yes, there are many well documented algorithms like:

  • 余弦相似度
  • Jaccard 相似度
  • 骰子的系数
  • 匹配相似度
  • 重叠相似度
  • 等等等等

一个很好的总结(Sam 的字符串度量")可以在这里找到(原始链接已失效,因此它链接到 Internet Archive)

A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)

还要检查这些项目:

这篇关于Java中的相似性字符串比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆