算法 - 字符串相似性得分/哈希 [英] ALGORITHM - String similarity score/hash

查看:140
本文介绍了算法 - 字符串相似性得分/哈希的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法来计算像一串一般的相似性得分?在我不是比较两个字符串连接在一起,而是我得到一些数/分(散),每串可后来告诉我,两个字符串或不相似的方式。两个类似的字符串应该有类似(接近)分数/哈希值。

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number/scores (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) scores/hashes.

让我们考虑这些字符串和分数作为一个例子:

Let's consider these strings and scores as an example:

您好世界1000

世界,你好! 1010

Hello world! 1010

您好地球1125

富巴3250

FooBarbar 3750

FooBarbar 3750

富吧! 3300

富天下! 2350

您可以看到,世界,你好!和世界您好相似,他们的分数都接近对方。

You can see that Hello world! and Hello world are similar and their scores are close to each other.

这种方式,找到最相似的字符串,以给定的字符串将被减去给出的字符串的得分来自对方得分,然后排序他们的绝对值来完成。

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

我的最终目标是:会有流日志消息(仅纯消息),我想找到这些消息只被开始时,我能斗的模式(某种规律EX pression型)。但类似的字符串。我再次聚焦了的我应该得到一些数/分(散)每串,并且可以在以后告诉我,两个字符串或不相似

My end aim is : there would be streaming log messages(only pure messages) and i wanna find the pattern of those messages(some sort of regular expression type).But that gets started only when i can bucket similar strings. I again focus that I should get some number/scores (hash) for each string AND THAT CAN LATER tell me that two strings are or are not similar

推荐答案

看一看当地敏感的散列

的基本思想是散列输入项目,使类似的产品被映射到以高概率(桶比可能的输入项的宇宙小得多的数目)相同的水桶。

The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

有一个很好的解释可以这里以及一些示例code

There's a very good explanation available here together with some sample code.

这篇关于算法 - 字符串相似性得分/哈希的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆