在Python中,为什么单独的字典字符串值通过“in”平等检查(弦练习) [英] In Python, why do separate dictionary string values pass "in" equality checks? ( string Interning Experiment )

查看:158
本文介绍了在Python中,为什么单独的字典字符串值通过“in”平等检查(弦练习)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个Python实用程序,它将涉及将整数映射到字符串,其中许多整数可能映射到同一个字符串。根据我的理解,默认情况下,Python实习短字符串和大多数硬编码字符串,因此通过在表中保留字符串的规范版本来节省内存开销。我认为我可以通过实习字符串值从中受益,即使字符串实习更多地用于键哈希优化。我写了一个快速测试,检查长字符串的字符串相等性,首先只将字符串存储在列表中,然后将字符串存储在字典中作为值。这个行为对我来说是意想不到的:

  import sys 

top = 10000

non1 = []
non2 = []
for i in range(top):
s1 ='{:010d}'format(i)
s2 =' {:010d}'。format(i)
non1.append(s1)
non2.append(s2)

same = True
for i in range
same = same and(non1 [i] is non2 [i])
print(non:,same)#prints False
del non1 [:]
del non2 [:]


with1 = []
with2 = []
for i in range(top):
s1 = sys。实习('{:010d}'。format(i))
s2 = sys.intern('{:010d}'。format(i))
with1.append(s1)
with2.append(s2)

same = True
for i in range(top):
same = same and(with1 [i] is2 [i])
print(with:,same)#prints True

########################## $ _

non_dict = {}
non_dict [1] =这是一个长字符串
non_dict [2] =这是另一个长字符串
non_dict [3] =这是一个长字符串
non_dict [4] =这是另一个长字符串

with_dict = {}
with_dict [1] = sys.intern(这是一个很长的string)
with_dict [2] = sys.intern(this is long long string)
with_dict [3] = sys.intern(this is a long string)
with_dict [4] = sys.intern(这是另一个长字符串)

print(non:,non_dict [1]是non_dict [3],non_dict [2]是non_dict [4] )#打印True ???
打印(with:,with_dict [1]是with_dict [3],with_dict [2]是with_dict [4])#打印True

我以为非dict的检查会导致一个False打印出来,但我显然是错误的。有谁会知道发生了什么,在我的情况下,字幕实习是否会产生任何好处?如果我整合了多个输入文本的数据,我可以拥有许多多个多个键值,所以我正在寻找一种方法来节省内存空间。 (也许我必须使用数据库,但这不在此问题的范围之内。)
提前谢谢!

解决方案

字节码编译器执行的一个优化类似于但不同于interning,它将在相同的代码块中使用相同的对象作为相同的常量。这里的字符串文字:

  non_dict = {} 
non_dict [1] =这是一个长字符串
non_dict [2] =这是另一个长字符串
non_dict [3] =这是一个长字符串
non_dict [4] =这是另一个长字符串

在相同的代码块中,所以相同的字符串最终由相同的字符串对象表示。


I am building a Python utility that will involve mapping integers to word strings, where many integers might map to the same string. From my understanding, Python interns short strings and most hard-coded strings by default, saving memory overhead as a result by keeping a "canonical" version of the string in a table. I thought that I could benefit from this by interning string values, even though string interning is built more for key hashing optimization. I wrote a quick test that checks string equality for long strings, first with just strings stored in a list, and then strings stored in a dictionary as values. The behavior is unexpected to me:

import sys

top = 10000

non1 = []
non2 = []
for i in range(top):
    s1 = '{:010d}'.format(i)
    s2 = '{:010d}'.format(i)
    non1.append(s1)
    non2.append(s2)

same = True
for i in range(top):
    same = same and (non1[i] is non2[i])
print("non: ", same) # prints False
del non1[:]
del non2[:]


with1 = []
with2 = []
for i in range(top):
    s1 = sys.intern('{:010d}'.format(i))
    s2 = sys.intern('{:010d}'.format(i))
    with1.append(s1)
    with2.append(s2)

same = True
for i in range(top):
    same = same and (with1[i] is with2[i])
print("with: ", same) # prints True

###############################

non_dict = {}
non_dict[1] = "this is a long string"
non_dict[2] = "this is another long string"
non_dict[3] = "this is a long string"
non_dict[4] = "this is another long string"

with_dict = {}
with_dict[1] = sys.intern("this is a long string")
with_dict[2] = sys.intern("this is another long string")
with_dict[3] = sys.intern("this is a long string")
with_dict[4] = sys.intern("this is another long string")

print("non: ",  non_dict[1] is non_dict[3] and non_dict[2] is non_dict[4]) # prints True ???
print("with: ", with_dict[1] is with_dict[3] and with_dict[2] is with_dict[4]) # prints True

I thought that the non-dict checks would result in a "False" print-out, but I was clearly mistaken. Would anyone know what is happening, and whether string interning would yield any benefits at all in my case? I could have many, many more keys than single value if I consolidate data from several input texts, so I am searching for a way to save memory space. (Maybe I will have to use a data-base, but that is outside the scope of this question.) Thank you in advance!

解决方案

One of the optimizations performed by the bytecode compiler, similar to but distinct from interning, is that it will use the same object for equal constants in the same code block. The string literals here:

non_dict = {}
non_dict[1] = "this is a long string"
non_dict[2] = "this is another long string"
non_dict[3] = "this is a long string"
non_dict[4] = "this is another long string"

are in the same code block, so equal strings end up represented by the same string object.

这篇关于在Python中,为什么单独的字典字符串值通过“in”平等检查(弦练习)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆