Python的摘要/散列字符串相似 [英] Python digest/hash for string similarity

查看：116 发布时间：2015/11/30 15:06:43 python algorithm similarity

本文介绍了Python的摘要/散列字符串相似的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在寻找一种算法，可以生成一个短（FX 16个字符（并不重要），散列code /从一个更长的字符串消化。

I'm looking for an algorithm which can generate a short (fx 16 chars (not important) hashcode/digest from a longer string.

的主要要求是，这几乎是相同字符串应该产生相同摘要

The main requirement is that strings which is almost identical should result in the same digest.

Fx的2几乎相同的邮件：

Fx 2 almost identical mail:

马丁嗨。这里有一些......垃圾邮件给你。问候XYZ。 => AAAA AAAA AAAA AAAA

Hi Martin. Here are some ... spam for you. Regards XYZ. => AAAA AAAA AAAA AAAA

博你好。这里有一些......垃圾邮件给你。问候EFG。 => AAAA AAAA AAAA AAAA

Hi Bo. Here are some ... spam for you. Regards EFG. => AAAA AAAA AAAA AAAA

返回相同diges（或几乎相同），其中，作为一种不同的邮件：

returns the same diges (or almost the same), where as a different mail:

您好芬兰人。这是一个测试邮件。 => CCCC CCCC CCCC CCCC

Hello Finn. This is a test mail. => CCCC CCCC CCCC CCCC

将返回不同的摘要。

该算法将是一个垃圾邮件过滤器的一部分。该过滤器会记住从邮件消化它是一定是垃圾邮件。如果同样的摘要显示在邮件地方是毋庸置疑的，相同的消化会导致过滤器，以增加spamscore。

This algorithm would be part of a spam filter. The filter will remember digests from mails which it is certain is spam. If the same digest shows up in mails where it is in doubt, the identical digest will cause the filter to increase the spamscore.

我知道莱文斯坦，但它需要我知道弦锋线。在这种情况下，我没有这个信息。我能有这样的信息，但这需要的过滤器存储中的所有垃圾邮件，并检查对每个人，这将是一个非常缓慢的过程。

I know about Levenshtein, but it requires me to know the strings up front. In this situation i do not have this information. I could have this information, but that would require the filter for store all spam e-mail and check against each one, which would be a very slow process.

也许再加上两者之间的Levenshtein距离的计算值有些松动玉米pression算法可以工作

Maybe some loose compression algorithm coupled with a calc of the Levenshtein distance between the two could work.

任何指针AP preciated。

Any pointers appreciated.

Python的摘要/散列字符串相似 [英] Python digest/hash for string similarity

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python的摘要/散列字符串相似 [英] Python digest/hash for string similarity

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭