为什么 pandas 数据框单元格的ID每次执行都会改变? [英] Why id of a pandas dataframe cell changes with each execution?

查看:81
本文介绍了为什么 pandas 数据框单元格的ID每次执行都会改变?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我试图确保数据框视图的某些属性时遇到了这个问题。

I ran into this problem when I was trying to make sure some properties of data frame's view.

假设我将数据框定义为: df = pd.DataFrame(columns = list('abc'),data = np.arange(18).reshape(6,3)),此数据帧的视图定义为: df1 = df.iloc [:3,:] 。现在,我们有两个数据帧,如下所示:

Suppose I have a dataframe defined as: df = pd.DataFrame(columns=list('abc'), data=np.arange(18).reshape(6, 3)) and a view of this dataframe defined as: df1 = df.iloc[:3, :]. We now have two dataframes as following:

print(df)
    a   b   c
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17

print(df1)

   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8

现在我要输出这两个数据帧的特定单元格的ID:

Now I want to output the id of a particular cell of these two dataframes:

print(id(df.loc[0, 'a']))
print(id(df1.loc[0, 'a']))

,我的输出为:

140114943491408
140114943491408

奇怪的是,如果我连续执行这两行打印ID代码,id也随之更改:

The weird thing is, if I continuously execute those two lines of 'print id' code, the ids change as well:

140114943491480
140114943491480

我必须强调的是,当我执行这两个打印ID代码时,我没有执行 df定义代码,因此df和df1不能重新定义内德然后,我认为应该固定数据帧中每个元素的内存地址,那么输出将如何更改?

I have to emphasize that I did not execute the 'df definition' code when I execute those two 'print id' code, so the df and df1 are not redefined. Then, in my opinion, the memory address of each element in the data frame should be fixed, so how could the output changes?

执行这两行打印ID代码。在某些罕见的情况下,这两个id甚至彼此不相等:

A more weird thing happens when I keep executing those two lines of 'print id' codes. In some rare scenarios, those two ids even do not equal to each other:

140114943181088
140114943181112

但是如果我执行 id(df.loc [0,'a'])== id(df1.loc [0,'a'])同时,python仍然输出 True 。我知道,由于df1是df的视图,因此它们的单元格应该共享一个内存,但是其id的输出有时会有所不同?

But if I execute id(df.loc[0, 'a']) == id(df1.loc[0, 'a']) at the same time, python still output True. I know that since df1 is a view of df, their cells should share one memory, but how come the output of their ids could be different occasionally?

这些奇怪的行为使我完全迷失了。谁能解释这些行为?它们是由于数据帧或python中的id函数的特性引起的吗?谢谢!

Those strange behaviors make me totally at lost. Could anyone explain those behaviors? Are they due to the characteristics of data frame or the id function in python? Thanks!

仅供参考,我使用的是 Python 3.5.2

FYI, I am using Python 3.5.2.

推荐答案

您没有获取单元格的ID,而是获取了<<返回的对象的 id code> .loc 访问器,它是基础数据的盒装版本。

You are not getting the id of a "cell", you are getting the id of the object returned by the .loc accessor, which is a boxed version of the underlying data.

因此,

>>> import pandas as pd
>>> df = pd.DataFrame(columns=list('abc'), data=np.arange(18).reshape(6, 3))
>>> df1 = df.iloc[:3, :]
>>> df.dtypes
a    int64
b    int64
c    int64
dtype: object
>>> df1.dtypes
a    int64
b    int64
c    int64
dtype: object

但是由于Python中的一切是对象,因此您的 loc 方法必须返回一个对象:

But since everything in Python is an object, your loc method must return an object:

>>> x = df.loc[0, 'a']
>>> x
0
>>> type(x)
<class 'numpy.int64'>
>>> isinstance(x, object)
True

但是,实际的基础缓冲区是原始数组C个固定大小的64位有符号整数。它们不是Python对象,它们被装箱以从其他将原始类型与对象混合在一起的语言中借用一个术语。

However, the actual underlying buffer is a primitive array of C fixed-size 64-bit signed integers. They are not Python objects, they are "boxed" to borrow a term from other languages which mix primitive types with objects.

现在,您在所有对象上看到的现象具有相同的 id

Now, the phenomenon you are seeing with all objects having the same id:

>>> id(df.loc[0, 'a']), id(df.loc[0, 'a'])
(4539673432, 4539673432)
>>> id(df.loc[0, 'a']), id(df.loc[0, 'a']), id(df1.loc[0,'a'])
(4539673432, 4539673432, 4539673432)

出现是因为在Python中,对象可以自由地重用最近回收的对象的内存地址。确实,当您创建 id 的元组时, loc 返回的对象仅存在足够长的时间才能通过并由第一次调用 id 处理,第二次使用 loc ,该对象已经被释放,只需重新调用-使用相同的内存。您可以在任何Python对象上看到相同的行为,例如 list

Occurs because in Python, objects are free to re-use the memory address of recently reclaimed objects. Indeed, when you create your tuple of id's, the object's returned by loc only exist long enough to get passed and processed by the first invocation of id, the second time you use loc, the object, already deallocated, simply re-uses the same memory. You can see the same behavior with any Python object, like a list:

>>> id([]), id([])
(4545276872, 4545276872)

id 只能保证在对象的生存期中是唯一的。在此此处详细了解。但是,请注意,在以下情况下,它将始终是不同的:

Fundamentally, id's are only guaranteed to be unique for the lifetime of the object. Read more about this phenomenon here. But, note, in the following case, it will always be different:

>>> x = df.loc[0, 'a']
>>> x2 = df.loc[0, 'a']
>>> id(x), id(x2)
(4539673432, 4539673408)

自维护以来

请注意,对于许多不可变的对象,解释器可以自由优化并返回完全相同的对象。在CPython中,小整数就是所谓的小整数缓存:

Note, for many immutable objects, the interpreter is free to optimize and return the same exact object. In CPython, this is the case with "small ints", the so called small-int cache:

>>> x = 2
>>> y = 2
>>> id(x), id(y)
(4304820368, 4304820368)

但这是

如果您想向自己证明数据帧正在共享相同的基础缓冲区,只需对其进行突变即可,会在视图之间看到相同的变化:

If you want to prove to yourself that your data-frames are sharing the same underlying buffer, just mutate them and you'll see the same change reflected across views:

>>> df
    a   b   c
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17
>>> df1
   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8
>>> df.loc[0, 'a'] = 99
>>> df
    a   b   c
0  99   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17
>>> df1
    a  b  c
0  99  1  2
1   3  4  5
2   6  7  8

这篇关于为什么 pandas 数据框单元格的ID每次执行都会改变?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆