Python-为什么不总是缓存所有不可变的对象? [英] Python - Why not all immutable objects are always cached?

查看:78
本文介绍了Python-为什么不总是缓存所有不可变的对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定下面代码的Python对象模型到底是怎么回事。



您可以下载ctabus.csv的数据该文件的链接

  import csv 

def read_as_dicts(filename):
records = []
with open(filename)as f:
rows = csv .reader(f)
标头=下一行(行)

用于行中的行:
route = row [0]
date = row [1]
daytype = row [2]
游乐设施= int(row [3])
records.append({
'route':route,
'date':date,
'daytype':daytype,
'rides':rides})

返回记录

#从csv中读取数据
行= read_as_dicts('ctabus.csv')
print(len(rows))#736461

#记录失败e id(对象id)
route_ids = set()
用于行中的行:
route_ids.add(id(row ['route']))

print(len(route_ids))#690072

#unique_routes
unique_routes = set()
用于行中的行:
unique_routes.add(row ['route' ])

print(len(unique_routes))#185

当我调用 print(len(route_ids))会打印 690072 。为什么Python最终创建了这么多对象?



我希望此计数为185或736461。185因为,当我计算集合中的唯一路由时,该集合的长度为185。736461因为,这就是csv文件中的记录总数。



这个奇怪的数字 690072是什么?



我试图理解为什么要进行部分缓存?为什么python无法执行完全缓存,如下所示。

  import csv 

route_cache = {}

#一些黑客缓存
def cached_route(routename):
如果路由名不在route_cache中:
route_cache [routename] =路由名
返回route_cache [routename]

def read_as_dicts (文件名):
记录= []
,open(文件名)为f:
行= csv.reader(f)
标头=下一步(行)

用于行中的行:
row [0] = cached_route(row [0])#cache技巧
route = row [0]
date = row [1]
daytype = row [2]
游乐设施= int(row [3])
records.append({
'route':route,
'date':date,
'daytype':daytype,
'rides':rides})

返回记录

#从csv中读取数据
行= read_as_dicts('ctabus.csv')
print(len(rows))#736461

#unique_routes
unique_routes = set()
用于一行:
unique_routes.add(row ['route'])

print(len(unique_routes)) #185

#记录路线ID(对象ID)
route_ids = set()
以行表示:
route_ids.add(id(row ['route ']))

print(len(route_ids))#185


解决方案

文件中的典型记录如下:

 行[0] 
{'route':'3','date':'01 / 01/2001','daytype':'U','rides:7354}


这意味着大多数不可变对象都是字符串,并且只有'rides'值是整数。



对于小整数( -5 ... 255 ),Python3保留一个整数池-因此,这些小整数就像被缓存了(只要 PyLong_FromLong 和Co。)。



规则对于字符串来说更复杂-如@所指出的timgeb,被拘禁。 关于实习的很棒的文章,即使是关于Python2.7的,但不是从那以后发生了很大的变化。简而言之,最重要的规则是:


  1. 所有长度为 0 的字符串和 1 会被禁闭。

  2. 具有多个字符的小数位如果被构成可以在标识符中使用的字符组成,则被禁闭直接或通过窥孔优化 / 恒定折叠(但在第二种情况下,仅当结果不超过20个字符时(从Python 3.7开始的)。

以上所有都是实现细节,但考虑到它们,我们对于上面的 row [0] 可获得以下信息:


  1. route, date, daytype, rides 都被保留,因为它们是在编译时创建的函数 read_as_dicts 并且没有奇怪字符。

  2. '3''W'被拘禁,因为它们的长度仅为 1

  3. 01/01/2001 不会被冻结,因为它的时间长于 1 (在运行时创建且仍然没有资格,因为其中包含字符 /

  4. 7354 不是来自较小的整数池,因为太大。但是其他条目也可能来自此池。

这是当前行为的一种解释,只有某些对象被缓存。



但是Python为什么不缓存所有创建的字符串/整数?



让我们从整数开始。为了能够在已经创建整数的情况下快速查找(比 O(n)快得多),必须保留一次附加查找数据结构,需要额外的内存。但是,整数太多了,以至于再次碰到一个已经存在的整数的可能性不是很高,因此在大多数情况下,查询数据结构的内存开销不会得到补偿。

b
$ b

由于字符串需要更多的内存,因此查找数据结构的相对(内存)成本并不是很高。但是,实习生1000个字符的字符串没有任何意义,因为随机创建的字符串具有完全相同的字符的可能性几乎为 0



另一方面,例如,如果使用哈希字典作为查找结构,则哈希的计算将花费 O (n) n 个字符),对于大字符串来说可能无法解决。



因此,Python进行了权衡,在大多数情况下效果很好-但在某些特殊情况下它并不完美。但是对于那些特殊情况,您可以使用 sys.intern()






注意:具有相同的ID如果两个对象的生存时间不重合,则并不意味着是同一对象-因此您在问题中的推理并不是万无一失的-但这在这种特殊情况下没有关系。


I am not sure what is happening under the hood with regards to the Python object model for the code below.

You can download the data for the ctabus.csv file from this link

import csv

def read_as_dicts(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headers = next(rows)

        for row in rows:
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            records.append({
                    'route': route,
                    'date': date,
                    'daytype': daytype,
                    'rides': rides})

    return records

# read data from csv
rows = read_as_dicts('ctabus.csv')
print(len(rows)) #736461

# record route ids (object ids)
route_ids = set()
for row in rows:
    route_ids.add(id(row['route']))

print(len(route_ids)) #690072

# unique_routes
unique_routes = set()
for row in rows:
    unique_routes.add(row['route'])

print(len(unique_routes)) #185

When I call print(len(route_ids)) it prints "690072". Why did Python end up creating these many objects?

I expect this count to be either 185 or 736461. 185 because, when I count the unique routes in set the length of that set comes out to be 185. 736461 because, this is the total number of records in csv file.

What is this weird number "690072"?

I am trying to understand why this partial caching? Why python can't perform a full caching something like below.

import csv

route_cache = {}

#some hack to cache
def cached_route(routename):
    if routename not in route_cache:
        route_cache[routename] = routename
    return route_cache[routename]

def read_as_dicts(filename):
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headers = next(rows)

        for row in rows:
            row[0] = cached_route(row[0]) #cache trick
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            records.append({
                    'route': route,
                    'date': date,
                    'daytype': daytype,
                    'rides': rides})

    return records

# read data from csv
rows = read_as_dicts('ctabus.csv')
print(len(rows)) #736461

# unique_routes
unique_routes = set()
for row in rows:
    unique_routes.add(row['route'])

print(len(unique_routes)) #185

# record route ids (object ids)
route_ids = set()
for row in rows:
    route_ids.add(id(row['route']))

print(len(route_ids)) #185

解决方案

A typical record from the file looks like following:

rows[0]
{'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354}

That means most of your immutable objects are strings and only the 'rides'-value is an integer.

For small integers (-5...255), Python3 keeps an integer pool - so these small integers feels like being cached (as long as PyLong_FromLong and Co. are used).

The rules are more complicated for strings - they are, as pointed out by @timgeb, interned. There is a greate article about interning, even if it is about Python2.7 - but not much changed since then. In a nutshell, the most important rules are:

  1. all strings of length 0 and 1 are interned.
  2. stings with more than one character are interned if they constist of characters that can be used in identifiers and are created at compile time either directly or through peephole optimization/constant folding (but in the second case only if the result is no longer than 20 characters (4096 since Python 3.7).

All of the above are implementation details, but taking them into account we get the following for the row[0] above:

  1. 'route', 'date', 'daytype', 'rides' are all interned because they created at compile time of the function read_as_dicts and don't have "strange" characters.
  2. '3' and 'W' are interned because their length is only 1.
  3. 01/01/2001 isn't interned because it is longer than 1, created at runtime and whouldn't qualify anyway because it has character / in it.
  4. 7354 isn't from the small integer pool, because too large. But other entries might be from this pool.

This was an explanation for the current behavior, with only some objects being "cached".

But why doesn't Python cache all created strings/integer?

Let's start with integers. In order to be able to look-up fast if an integer-number is already created (much faster than O(n)), one has to keep an additional look-up data-structure, which needs additional memory. However, there are so many integers, that the probability to hit one already existing integer again is not very high, so the memory overhead for the look-up-data-structure will not be repaid in the most cases.

Because strings need more memory, the relative (memory) cost of the look-up data-structure isn't that high. But it doesn't make any sense to intern a 1000-character-string, because the probability for a randomly created string to have the very same characters is almost 0!

On the other hand, if for example a hash-dictionary is used as the look-up structure, the calculation of the hash will take O(n) (n-number of characters), which probably won't pay off for large strings.

Thus, Python makes a trade off, which works pretty well in most scenarios - but it cannot be perfect in some special cases. Yet for those special scenarios you can optimize per hand using sys.intern().


Note: Having the same id doesn't mean to be the same object, if the live time of two objects don't overlapp, - so your reasoning in the question isn't entrirely watherproof - but this is of no consequence in this special case.

这篇关于Python-为什么不总是缓存所有不可变的对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆