将名称字符串编码为唯一数字 [英] Encoding name strings into an unique number

查看:846
本文介绍了将名称字符串编码为唯一数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多名字(数百万)。他们每个人都有一个名字,一个可选的中间名和一个姓氏。我需要将这些名称编码为唯一表示名称的数字。编码应为一对一,即名称应仅与一个数字关联,而数字应仅与一个名称关联。

I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.

什么是聪明的编码的方式?我知道很容易根据名称在字母表集中的位置来标记名称的每个字母(a-> 1,b-> 2 ..依此类推),因此类似Deepa的名称将变为-> 455161,但是再次在这里,我无法确定'16'是真的16还是1和6的组合。

What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.

因此,我正在寻找一种编码名称的聪明方法。

So, I am looking for a smart way of encoding the names.

此外,编码应确保任何名称的输出数字中的数字位数应具有固定的数字位数,即,其长度应独立于长度。

Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?

谢谢
Abhishek S

Thanks Abhishek S

推荐答案

您要尝试的实际上是散列(至少在您有固定位数的情况下)。有一些冲突很少的好的哈希算法。例如,尝试sha1,它经过了良好的测试并且可以用于现代语言(请参见 http://en.wikipedia .org / wiki / Sha1 )-对于git似乎已经足够好了,所以它可能对您有用。

What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.

可能会为两个不同的名称使用相同的哈希值,但是哈希始终是这种情况,可以解决。使用sha1等,您将不会在名称和ID之间建立任何明显的联系,根据您的问题,这可能是好事,也可能是坏事。

There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.

要确定需要唯一的ID,您将需要执行NealB建议的操作,自己创建ID并在数据库中连接名称和ID(您可以随机创建它们并检查冲突或增加它们,从0000000000001开始)。

If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).

(经过深思熟虑并阅读了第一条评论后,答案得到了改善)

(improved answer after giving it some thought and reading the first comments)

这篇关于将名称字符串编码为唯一数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆