Python-为什么不总是缓存所有不可变的对象? [英] Python - Why not all immutable objects are always cached?
问题描述
我不确定下面代码的Python对象模型到底是怎么回事。
您可以下载ctabus.csv的数据该文件的链接
import csv
def read_as_dicts(filename):
records = []
with open(filename)as f:
rows = csv .reader(f)
标头=下一行(行)
用于行中的行:
route = row [0]
date = row [1]
daytype = row [2]
游乐设施= int(row [3])
records.append({
'route':route,
'date':date,
'daytype':daytype,
'rides':rides})
返回记录
#从csv中读取数据
行= read_as_dicts('ctabus.csv')
print(len(rows))#736461
#记录失败e id(对象id)
route_ids = set()
用于行中的行:
route_ids.add(id(row ['route']))
print(len(route_ids))#690072
#unique_routes
unique_routes = set()
用于行中的行:
unique_routes.add(row ['route' ])
print(len(unique_routes))#185
当我调用 print(len(route_ids))
会打印 690072
。为什么Python最终创建了这么多对象?
我希望此计数为185或736461。185因为,当我计算集合中的唯一路由时,该集合的长度为185。736461因为,这就是csv文件中的记录总数。
这个奇怪的数字 690072是什么?
我试图理解为什么要进行部分缓存?为什么python无法执行完全缓存,如下所示。
import csv
route_cache = {}
#一些黑客缓存
def cached_route(routename):
如果路由名不在route_cache中:
route_cache [routename] =路由名
返回route_cache [routename]
def read_as_dicts (文件名):
记录= []
,open(文件名)为f:
行= csv.reader(f)
标头=下一步(行)
用于行中的行:
row [0] = cached_route(row [0])#cache技巧
route = row [0]
date = row [1]
daytype = row [2]
游乐设施= int(row [3])
records.append({
'route':route,
'date':date,
'daytype':daytype,
'rides':rides})
返回记录
#从csv中读取数据
行= read_as_dicts('ctabus.csv')
print(len(rows))#736461
#unique_routes
unique_routes = set()
用于一行:
unique_routes.add(row ['route'])
print(len(unique_routes)) #185
#记录路线ID(对象ID)
route_ids = set()
以行表示:
route_ids.add(id(row ['route ']))
print(len(route_ids))#185
文件中的典型记录如下:
行[0]
{'route':'3','date':'01 / 01/2001','daytype':'U','rides:7354}
这意味着大多数不可变对象都是字符串,并且只有
'rides'
值是整数。
对于小整数(
-5 ... 255
),Python3保留一个整数池-因此,这些小整数就像被缓存了(只要PyLong_FromLong
和Co。)。
规则对于字符串来说更复杂-如@所指出的timgeb,被拘禁。 关于实习的很棒的文章,即使是关于Python2.7的,但不是从那以后发生了很大的变化。简而言之,最重要的规则是:
- 所有长度为
0
的字符串和1
会被禁闭。
- 具有多个字符的小数位如果被构成可以在标识符中使用的字符组成,则被禁闭直接或通过窥孔优化 / 恒定折叠(但在第二种情况下,仅当结果不超过20个字符时(从Python 3.7开始的)。
以上所有都是实现细节,但考虑到它们,我们对于上面的
row [0]
可获得以下信息:
route, date, daytype, rides
都被保留,因为它们是在编译时创建的函数read_as_dicts
并且没有奇怪字符。
'3'
和'W'
被拘禁,因为它们的长度仅为1
。
01/01/2001
不会被冻结,因为它的时间长于1
(在运行时创建且仍然没有资格,因为其中包含字符/
。
7354
不是来自较小的整数池,因为太大。但是其他条目也可能来自此池。
这是当前行为的一种解释,只有某些对象被缓存。
但是Python为什么不缓存所有创建的字符串/整数?
让我们从整数开始。为了能够在已经创建整数的情况下快速查找(比
bO(n)
快得多),必须保留一次附加查找数据结构,需要额外的内存。但是,整数太多了,以至于再次碰到一个已经存在的整数的可能性不是很高,因此在大多数情况下,查询数据结构的内存开销不会得到补偿。
$ b由于字符串需要更多的内存,因此查找数据结构的相对(内存)成本并不是很高。但是,实习生1000个字符的字符串没有任何意义,因为随机创建的字符串具有完全相同的字符的可能性几乎为
0
!
另一方面,例如,如果使用哈希字典作为查找结构,则哈希的计算将花费
O (n)
(n
个字符),对于大字符串来说可能无法解决。
因此,Python进行了权衡,在大多数情况下效果很好-但在某些特殊情况下它并不完美。但是对于那些特殊情况,您可以使用
sys.intern()
。
注意:具有相同的ID如果两个对象的生存时间不重合,则并不意味着是同一对象-因此您在问题中的推理并不是万无一失的-但这在这种特殊情况下没有关系。
I am not sure what is happening under the hood with regards to the Python object model for the code below.
You can download the data for the ctabus.csv file from this link
import csv def read_as_dicts(filename): records = [] with open(filename) as f: rows = csv.reader(f) headers = next(rows) for row in rows: route = row[0] date = row[1] daytype = row[2] rides = int(row[3]) records.append({ 'route': route, 'date': date, 'daytype': daytype, 'rides': rides}) return records # read data from csv rows = read_as_dicts('ctabus.csv') print(len(rows)) #736461 # record route ids (object ids) route_ids = set() for row in rows: route_ids.add(id(row['route'])) print(len(route_ids)) #690072 # unique_routes unique_routes = set() for row in rows: unique_routes.add(row['route']) print(len(unique_routes)) #185
When I call
print(len(route_ids))
it prints"690072"
. Why did Python end up creating these many objects?I expect this count to be either 185 or 736461. 185 because, when I count the unique routes in set the length of that set comes out to be 185. 736461 because, this is the total number of records in csv file.
What is this weird number "690072"?
I am trying to understand why this partial caching? Why python can't perform a full caching something like below.
import csv route_cache = {} #some hack to cache def cached_route(routename): if routename not in route_cache: route_cache[routename] = routename return route_cache[routename] def read_as_dicts(filename): records = [] with open(filename) as f: rows = csv.reader(f) headers = next(rows) for row in rows: row[0] = cached_route(row[0]) #cache trick route = row[0] date = row[1] daytype = row[2] rides = int(row[3]) records.append({ 'route': route, 'date': date, 'daytype': daytype, 'rides': rides}) return records # read data from csv rows = read_as_dicts('ctabus.csv') print(len(rows)) #736461 # unique_routes unique_routes = set() for row in rows: unique_routes.add(row['route']) print(len(unique_routes)) #185 # record route ids (object ids) route_ids = set() for row in rows: route_ids.add(id(row['route'])) print(len(route_ids)) #185
解决方案A typical record from the file looks like following:
rows[0] {'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354}
That means most of your immutable objects are strings and only the
'rides'
-value is an integer.For small integers (
-5...255
), Python3 keeps an integer pool - so these small integers feels like being cached (as long asPyLong_FromLong
and Co. are used).The rules are more complicated for strings - they are, as pointed out by @timgeb, interned. There is a greate article about interning, even if it is about Python2.7 - but not much changed since then. In a nutshell, the most important rules are:
- all strings of length
0
and1
are interned.- stings with more than one character are interned if they constist of characters that can be used in identifiers and are created at compile time either directly or through peephole optimization/constant folding (but in the second case only if the result is no longer than 20 characters (4096 since Python 3.7).
All of the above are implementation details, but taking them into account we get the following for the
row[0]
above:
'route', 'date', 'daytype', 'rides'
are all interned because they created at compile time of the functionread_as_dicts
and don't have "strange" characters.'3'
and'W'
are interned because their length is only1
.01/01/2001
isn't interned because it is longer than1
, created at runtime and whouldn't qualify anyway because it has character/
in it.7354
isn't from the small integer pool, because too large. But other entries might be from this pool.This was an explanation for the current behavior, with only some objects being "cached".
But why doesn't Python cache all created strings/integer?
Let's start with integers. In order to be able to look-up fast if an integer-number is already created (much faster than
O(n)
), one has to keep an additional look-up data-structure, which needs additional memory. However, there are so many integers, that the probability to hit one already existing integer again is not very high, so the memory overhead for the look-up-data-structure will not be repaid in the most cases.Because strings need more memory, the relative (memory) cost of the look-up data-structure isn't that high. But it doesn't make any sense to intern a 1000-character-string, because the probability for a randomly created string to have the very same characters is almost
0
!On the other hand, if for example a hash-dictionary is used as the look-up structure, the calculation of the hash will take
O(n)
(n
-number of characters), which probably won't pay off for large strings.Thus, Python makes a trade off, which works pretty well in most scenarios - but it cannot be perfect in some special cases. Yet for those special scenarios you can optimize per hand using
sys.intern()
.
Note: Having the same id doesn't mean to be the same object, if the live time of two objects don't overlapp, - so your reasoning in the question isn't entrirely watherproof - but this is of no consequence in this special case.
这篇关于Python-为什么不总是缓存所有不可变的对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!