Python如何确定两个字符串是否相同 [英] How does Python determine if two strings are identical

查看:1886
本文介绍了Python如何确定两个字符串是否相同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解Python字符串何时相同(也就是共享相同的内存位置).但是,在我的测试中,当两个相等的字符串变量共享相同的内存时,似乎没有明显的解释:

I've tried to understand when Python strings are identical (aka sharing the same memory location). However during my tests, there seems to be no obvious explanation when two string variables that are equal share the same memory:

import sys
print(sys.version) # 3.4.3

# Example 1
s1 = "Hello"
s2 = "Hello"
print(id(s1) == id(s2)) # True

# Example 2
s1 = "Hello" * 3
s2 = "Hello" * 3
print(id(s1) == id(s2)) # True

# Example 3
i = 3
s1 = "Hello" * i
s2 = "Hello" * i
print(id(s1) == id(s2)) # False

# Example 4
s1 = "HelloHelloHelloHelloHello"
s2 = "HelloHelloHelloHelloHello"
print(id(s1) == id(s2)) # True

# Example 5
s1 = "Hello" * 5
s2 = "Hello" * 5
print(id(s1) == id(s2)) # False

字符串是不可变的,据我所知,Python试图重用现有的不可变对象,方法是让其他变量指向它们,而不是在内存中创建具有相同值的新对象.

Strings are immutable, and as far as I know Python tries to re-use existing immutable objects, by having other variables point to them instead of creating new objects in memory with the same value.

考虑到这一点,显然Example 1返回True.
(对我而言)仍然很明显Example 2返回True.

With this in mind, it seems obvious that Example 1 returns True.
It's still obvious (to me) that Example 2 returns True.

Example 3返回False对我来说并不明显-我做的与Example 2中的操作不同吗?!?

It's not obvious to me, that Example 3 returns False - am I not doing the same thing as in Example 2?!?

我偶然发现了这样的问题:
为什么比较字符串在Python中使用'=='或'is'有时会产生不同的结果吗?

I stumbled upon this SO question:
Why does comparing strings in Python using either '==' or 'is' sometimes produce a different result?

并通读 http://guilload.com/python-string-interning/ (尽管我可能不太了解),但-也许,"interned"字符串取决于长度,所以我在Example 4中使用了HelloHelloHelloHelloHello.结果为True.

and read through http://guilload.com/python-string-interning/ (though I probably didn't understand it all) and thougt - okay, maybe "interned" strings depend on the length, so I used HelloHelloHelloHelloHello in Example 4. The result was True.

让我感到困惑的是,它与Example 2相同,只是具有更大的数字(但它实际上会返回与Example 4相同的字符串)-但是,这次的结果是False?!?

And what the puzzled me, was doing the same as in Example 2 just with a bigger number (but it would effectively return the same string as Example 4) - however this time the result was False?!?

我真的不知道Python如何决定是否使用相同的内存对象,或何时创建新的内存对象.

I have really no idea how Python decides whether or not to use the same memory object, or when to create a new one.

是否有任何官方资料可以解释这种行为?

Are the any official sources that can explain this behavior?

推荐答案

来自您发布的链接:

避免使用大型.pyc文件

那为什么'a' * 21 is 'aaaaaaaaaaaaaaaaaaaaa'不能评估为True?您还记得在所有软件包中遇到的.pyc文件吗?好吧,Python字节码存储在这些文件中.如果有人写了这样的['foo!'] * 10**9会怎样?生成的 .pyc 文件将很大!为了避免这种现象,如果通过窥孔优化生成的序列长度大于20,则将其丢弃.

So why does 'a' * 21 is 'aaaaaaaaaaaaaaaaaaaaa' not evaluate to True? Do you remember the .pyc files you encounter in all your packages? Well, Python bytecode is stored in these files. What would happen if someone wrote something like this ['foo!'] * 10**9? The resulting .pyc file would be huge! In order to avoid this phenomena, sequences generated through peephole optimization are discarded if their length is superior to 20.

如果您有字符串"HelloHelloHelloHelloHello",Python将必须按原样存储它(要求解释器检测字符串中的重复模式以节省空间可能太多了).但是,对于解析时可以计算的字符串值(例如"Hello" * 5),Python会将这些值作为所谓的窥孔优化"的一部分进行评估,这可以决定是否值得对其进行预先计算.细绳.从len("Hello" * 5) > 20开始,解释程序将其保留原样,以避免存储太多长字符串.

If you have the string "HelloHelloHelloHelloHello", Python will necessarily have to store it as it is (asking the interpreter to detect repeating patterns in a string to save space might be too much). However, when it comes to string values that can be computed at parsing time, such as "Hello" * 5, Python evaluate those as part of this so-called "peephole optimization", which can decide whether it is worth it or not to precompute the string. Since len("Hello" * 5) > 20, the interpreter leaves it as it is to avoid storing too many long strings.

中所述这个问题,您可以在

As indicated in this question, you can check this on the source code in peephole.c, function fold_binops_on_constants, near the end you will see:

// ...
} else if (size > 20) {
    Py_DECREF(newconst);
    return -1;
}

实际上,该优化代码最近是移至适用于Python 3.7的AST优化器,因此现在您必须查看 ast_opt.c ,函数fold_binop,现在调用函数safe_multiply,该函数检查字符串不超过MAX_STR_SIZE

Actually, that optimization code has recently been moved to the AST optimizer for Python 3.7, so now you would have to look into ast_opt.c, function fold_binop, which calls now function safe_multiply, which checks that the string is no longer than MAX_STR_SIZE, newly defined as 4096. So it seems that the limit has been significantly bumped up for the next releases.

这篇关于Python如何确定两个字符串是否相同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆