将唯一的整数ID分配给字符串的最有效方法是什么? [英] The most effective way to assign unique integer id to a string?

查看:99
本文介绍了将唯一的整数ID分配给字符串的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写的程序处理大量对象,每个对象都有其自己的唯一标识,它本身就是一串复杂的结构(通过某个分隔符连接的对象的独特字段的十几个),并且长度很大。



由于我必须快速处理大量这些对象,并且我需要在处理时通过id对它们进行反驳,并且我无权更改它们的格式(我从外部检索它们,网络),我想将他们复杂的字符串ID映射到我自己的内部整数ID,并进一步用它进行比较,将它们进一步传递给其他进程等。



什么我将要做的是使用一个简单的字典与键作为对象的字符串ID和整数值作为我的内部整数ID。



我的问题是:有没有更好的方法来做到这一点?可能有一种方法可以手动计算一些散列,不管怎样?可能是字典不是最好的解决方案?



至于数字:系统中一次只有大约100K个这样的独特对象,所以整数容量更大为了便于比较,您可以 intern 这些字符串。 然后比较它们与,而不是 == ,它可以进行简单的指针比较,并且应该和或更快)比较两个整数:

 >>> 'foo'* 100是'foo'* 100 
False
>>> intern('foo'* 100)is intern('foo'* 100)
True

intern 保证 id(intern(A))== id(intern(B)) iff A == B 。确保输入后立即 intern 任何字符串。请注意,在Python 3.x中, intern 被称为 sys.intern



但是,当您必须将这些字符串传递给其他进程时,您的 dict 解决方案似乎是最好的。我通常在这种情况下做的是:

  str_to_id = {} 
for s in strings:
str_to_id .setdefault(s,len(str_to_id))




是绰绰有余的


Python整数是bigint,所以不应该是一个问题。


The program that I write processes a large number of objects, each with its own unique id, which itself is a string of complicated structure (dozen of unique fields of the object joined by some separator) and big length.

Since I have to process a lot of these objects fast and I need to reffer to them by id while processing and I have no power to change their format (I retrieve them externally, by network), I want to map their complicated string id to my own internal integer id and further use it for comparison, for transfering them further to other processes, etc.

What I'm going to do is to use a simple dict with keys as string id of the object and integer values as my internal integer id of it.

My question is: is there a better way in Python to do this? May be there is a way to calculate some hash manually, whatever? May be the dict is not the best solution?

As for numbers: there are about 100K of such unique objects in the system at a time, so the integer capacity is more than enough.

解决方案

For comparison purposes, you can intern the strings and then compare them with is instead of ==, which does a simple pointer comparison and should be as fast as (or faster than) comparing two integers:

>>> 'foo' * 100 is 'foo' * 100
False
>>> intern('foo' * 100) is intern('foo' * 100)
True

intern guarantees that id(intern(A)) == id(intern(B)) iff A == B. Be sure to intern any string as soon as it is input. Note that intern is called sys.intern in Python 3.x.

But when you have to pass these strings to other processes, your dict solution seems best. What I usually do in such situations is

str_to_id = {}
for s in strings:
    str_to_id.setdefault(s, len(str_to_id))

so the integer capacity is more than enough

Python integers are bigints, so that should never be a problem.

这篇关于将唯一的整数ID分配给字符串的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆