如何使用整数唯一标识一组字符串 [英] How to uniquely identify a set of strings using an integer

查看:122
本文介绍了如何使用整数唯一标识一组字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的问题陈述:

  • 我有一组与正则表达式匹配的字符串.假设它匹配[A-Z] [0-9] {3}(即1个字母和3个数字).
  • 我可以使用1到30之间的任意数量的字符串.例如,我可以拥有:
    • {A123}
    • {A123,B456}
    • {Z789,D752,E147,...,Q665}
    • ...
    • I have a set of strings that match a regular expression. let's say it matches [A-Z][0-9]{3} (i.e. 1 letter and 3 digits).
    • I can have any number of strings between 1 and 30. For example I could have:
      • {A123}
      • {A123, B456}
      • {Z789, D752, E147, ..., Q665}
      • ...

      我可以使用哪种算法?

      What sort of algorithm could I use?

      我的第一个想法是将字符串转换为数字,然后对它们进行操作(我认为是哈希函数),但是我不确定会给我什么公式会产生结果.

      My first idea would be to convert my strings to number and then do operations (I thought of hash functions) on them but I am not sure what formula would be give me could results.

      有什么建议吗?

      推荐答案

      您有2 ^ 333种可能的输入集(((26 * 10 ^ 3)选择30).

      You have 2^333 possible input sets ((26 * 10^3) choose 30).

      这意味着您将需要一个333位宽的整数来表示所有可能性.您最多只能有256位,因此会发生冲突.

      This means you would need a 333 bit wide integer to represent all possibilities. You only have a maximum of 256 bits, so there will be collisions.

      这是哈希函数的典型应用程序.有多种用途的哈希值,因此选择正确的类型很重要:

      This is a typical application for a hash function. There are hashes for various purposes, so it's important to select the right type:

      • 在基于存储桶的数据结构(字典)中使用的简单哈希函数必须快速.碰撞不仅是可以容忍的,而且是通缉的.散列的大小(以位为单位)通常很小.由于冲突,这种散列不适合您的目的.

      • A simple hash function for use in bucket based data structures (dictionaries) must be fast. Collisions are not only tolerated but wanted. The hash's size (in bits) usually is small. Due to collisions this type of hash is not suited for your purpose.

      校验和试图避免冲突,并且速度相当快.如果足够大,则可能足以满足您的情况.

      A checksum tries to avoid collisions and is reasonably fast. If it's large enough this might be enough for your case.

      密码散列的特征是不可能(或很难)找到冲突(即使知道输入和哈希值也是如此).而且它们是不可逆的(从散列中无法找到输入).对于您的用例,这些通常在计算上是昂贵的并且过分杀伤力的.

      Cryptographic hashes have the characteristic that it's not possible (or very hard) to find a collision (even when both input and hash are known). Also they are not invertible (from the hash it's not possible to find the input). These are usually computationally expensive and overkill for your use case.

      用于唯一标识任意输入的哈希,例如 CityHash SpookyHash 旨在用于快速哈希和无冲突识别.

      Hashes to uniquely identify arbitrary inputs, like CityHash and SpookyHash are designed for fast hashing and collision free identification.

      SpookyHash似乎是您的用例的不错选择.它的宽度为128位,这意味着您需要2 ^ 64个不同的输入才能获得单次碰撞50%的机会.

      SpookyHash seems like a good candidate for your use case. It's 128 bits wide, which means that you need 2^64 differing inputs to get a 50% chance of a single collision.

      它也很快:每个周期三个字节比md5或sha1快几个数量级. SpookyHash可在公共领域使用(请参见上面的链接).

      It's also fast: three bytes per cycle is orders of magnitude faster than md5 or sha1. SpookyHash is available in the public domain (see link above).

      要在用例上应用任何哈希,可以将列表中的项目转换为数字,但是将它们作为字符串输入似乎更容易.在这种情况下,您必须满足编码要求(ASCII可以).

      To apply any hash on your use case you could convert the items in your list to numbers, but it seems easier to just feed them as strings. You have to settle for an encoding in this case (ASCII would do).

      当I18N成为问题时,我通常使用UTF8左右.然后,有时需要特别注意规范化.但这不适用于您的简单用例.

      I'm usually using UTF8 or so, when I18N is an issue. Then it's sometimes important to care for canonicalization. But this does not apply to your simple use case.

      这篇关于如何使用整数唯一标识一组字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆