正则表达式匹配SHA1 [英] A Regex to match a SHA1

查看:714
本文介绍了正则表达式匹配SHA1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将通用文本中的SHA1与正则表达式匹配。



理想情况下,我想避免匹配单词。

可以肯定地说,完整的SHA1具有独特的模式(它们长且长度一致) - 因此我可以可靠地匹配这些模式 - 但缩写SHA1的情况如何?



我可以依赖数字的存在吗?



查看我的提交日志中的SHA1 - 总是数字出现在前3个字符中。但这是否太短?在我能假设一个数字之前,我需要考虑多少个SHA1字符?



这不一定是100%准确的 - 我只需要匹配99%的缩写SHA1。

解决方案

你可以认为SHA1散列是完全随机的,所以这可以简化为概率问题。给定数字不是数字的概率是6/16或0.375。三个SHA1数字全部不是数字的概率是0.375 ** 3或0.0527(5%ish)。在六位数字处,这再次减少到0.00278(0.2%)。在五位数字中,所有字母的概率降到1%以下(你说你想匹配99%的时间)。



很容易制定一个正则表达式始终与SHA1值匹配:

  \ b [0-9a-f] {5,40} \ b 

然而,这也可能匹配完美的五个字母单词,如已添加或已褪色。在我的 / usr / share / dict / words 文件中,有几个六个字母的单词匹配:accede,beaded,bedded,decade ,污损,抹掉和立面是最有可能的。在七封信中,只有已深入,不太可能出现在散文中。这一切都取决于你可以容忍多少误报,以及你会遇到的可能的词是什么。


I'm trying to match SHA1's in generic text with a regular expression.

Ideally I want to avoid matching words.

It's safe to say that full SHA1's have a distinctive pattern (they're long and a consistent length) - so I can match these reliably - but what about abbreviated SHA1's?

Can I rely on the presence of numbers?

Looking at the SHA1's in my commit log - numbers always appear in the first 3 characters. But is this too short? How many characters of SHA1 do I need to consider before I can assume a number would have appeared?

This does not have to be 100% accurate - I just need to match an abbreviated SHA1 99% of the time.

解决方案

You can consider the SHA1 hashes to be completely random, so this reduces to a matter of probabilities. The probability that a given digit is not a number is 6/16, or 0.375. The probability that three SHA1 digits are all not numbers is 0.375 ** 3, or 0.0527 (5% ish). At six digits, this reduces again to 0.00278 (0.2%). At five digits, the probability of all letters drops below 1% (you said you wanted to match 99% of the time).

It's easy to craft a regular expression that always matches SHA1 values:

\b[0-9a-f]{5,40}\b

However, this may also match perfectly good five letter words, like "added" or "faded". In my /usr/share/dict/words file, there are several six letter words that would match: "accede", "beaded", "bedded", "decade", "deface", "efface", and "facade" are the most likely. At seven letters, there is only "deedeed" which is unlikely to appear in prose. It all depends on how many false positives you can tolerate, and what the likely words you will encounter actually are.

这篇关于正则表达式匹配SHA1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆