为什么itertools.groupby可以在列表中而不是在numpy数组中对NaN进行分组 [英] Why can itertools.groupby group the NaNs in lists but not in numpy arrays
问题描述
我很难调试一个问题,当在itertools.groupby
中使用浮点nan
和numpy.array
中的nan
时,对它们的处理方式不同:
I'm having a difficult time to debug a problem in which the float nan
in a list
and nan
in a numpy.array
are handled differently when these are used in itertools.groupby
:
给出以下列表和数组:
from itertools import groupby
import numpy as np
lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
arr = np.array(lst)
当我遍历列表时,连续的nan
被分组:
When I iterate over the list the contiguous nan
s are grouped:
>>> for key, group in groupby(lst):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan, nan, nan] <class 'float'>
nan [nan] <class 'float'>
但是,如果我使用数组,则会将连续的nan
放在不同的组中:
However if I use the array it puts successive nan
s in different groups:
>>> for key, group in groupby(arr):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
即使我将数组转换回列表:
Even if I convert the array back to a list:
>>> for key, group in groupby(arr.tolist()):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
我正在使用:
numpy 1.11.3
python 3.5
我知道通常nan != nan
那么为什么这些操作会给出不同的结果? 怎么可能groupby
可以对nan
进行分组?
I know that generally nan != nan
so why do these operations give different results? And how is it possible that groupby
can group nan
s at all?
推荐答案
Python列表只是指向内存中对象的指针数组.特别是lst
保存指向对象np.nan
的指针:
Python lists are just arrays of pointers to objects in memory. In particular lst
holds pointers to the object np.nan
:
>>> [id(x) for x in lst]
[139832272211880, # nan
139832272211880, # nan
139832272211880, # nan
139832133974296,
139832270325408,
139832133974296,
139832133974464,
139832133974320,
139832133974296,
139832133974440,
139832272211880, # nan
139832133974296]
(np.nan
在我的计算机上为139832272211880.)
(np.nan
is at 139832272211880 on my computer.)
另一方面,NumPy数组只是内存的连续区域;它们是位和字节的区域,被NumPy解释为一系列值(浮点数,整数等).
On the other hand, NumPy arrays are just contiguous regions of memory; they are regions of bits and bytes that are interpreted as a sequence of values (floats, ints, etc.) by NumPy.
问题在于,当您要求Python遍历具有浮点值(在for
-loop或groupby
级别)的NumPy数组时,Python需要将这些字节装箱到适当的Python对象中.迭代时,它将在内存中为数组中的每个单个值创建一个全新的Python对象.
The trouble is that when you ask Python to iterate over a NumPy array holding floating values (at a for
-loop or groupby
level), Python needs to box these bytes into a proper Python object. It creates a brand new Python object in memory for each single value in the array as it iterates.
例如,您可以看到在调用.tolist()
时为每个nan
值创建了不同的对象:
For example, you can see that that distinct objects for each nan
value are created when .tolist()
is called:
>>> [id(x) for x in arr.tolist()]
[4355054616, # nan
4355054640, # nan
4355054664, # nan
4355054688,
4355054712,
4355054736,
4355054760,
4355054784,
4355054808,
4355054832,
4355054856, # nan
4355054880]
itertools.groupby
能够在np.nan
上分组以获取Python列表,因为它在比较Python对象时会先检查 identity .因为这些指向nan
的指针都指向同一个np.nan
对象,所以可以进行分组.
itertools.groupby
is able to group on np.nan
for the Python list because it checks for identity first when it compares Python objects. Because these pointers to nan
all point at the same np.nan
object, grouping is possible.
但是,在NumPy数组上进行迭代无法使此初始身份检查成功,因此Python会退回到检查是否相等和您所说的nan != nan
.
However, iteration over the NumPy array does not allow this initial identity check to succeed, so Python falls back to checking for equality and nan != nan
as you say.
这篇关于为什么itertools.groupby可以在列表中而不是在numpy数组中对NaN进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!