为什么itertools.groupby可以在列表中而不是在numpy数组中对NaN进行分组 [英] Why can itertools.groupby group the NaNs in lists but not in numpy arrays

查看:102
本文介绍了为什么itertools.groupby可以在列表中而不是在numpy数组中对NaN进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难调试一个问题,当在itertools.groupby中使用浮点nannumpy.array中的nan时,对它们的处理方式不同:

I'm having a difficult time to debug a problem in which the float nan in a list and nan in a numpy.array are handled differently when these are used in itertools.groupby:

给出以下列表和数组:

from itertools import groupby
import numpy as np

lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
arr = np.array(lst)

当我遍历列表时,连续的nan被分组:

When I iterate over the list the contiguous nans are grouped:

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan, nan, nan] <class 'float'>
nan [nan] <class 'float'>

但是,如果我使用数组,则会将连续的nan放在不同的组中:

However if I use the array it puts successive nans in different groups:

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>

即使我将数组转换回列表:

Even if I convert the array back to a list:

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>

我正在使用:

numpy 1.11.3
python 3.5

我知道通常nan != nan那么为什么这些操作会给出不同的结果? 怎么可能groupby可以对nan进行分组?

I know that generally nan != nan so why do these operations give different results? And how is it possible that groupby can group nans at all?

推荐答案

Python列表只是指向内存中对象的指针数组.特别是lst保存指向对象np.nan的指针:

Python lists are just arrays of pointers to objects in memory. In particular lst holds pointers to the object np.nan:

>>> [id(x) for x in lst]
[139832272211880, # nan
 139832272211880, # nan
 139832272211880, # nan
 139832133974296,
 139832270325408,
 139832133974296,
 139832133974464,
 139832133974320,
 139832133974296,
 139832133974440,
 139832272211880, # nan
 139832133974296]

(np.nan在我的计算机上为139832272211880.)

(np.nan is at 139832272211880 on my computer.)

另一方面,NumPy数组只是内存的连续区域;它们是位和字节的区域,被NumPy解释为一系列值(浮点数,整数等).

On the other hand, NumPy arrays are just contiguous regions of memory; they are regions of bits and bytes that are interpreted as a sequence of values (floats, ints, etc.) by NumPy.

问题在于,当您要求Python遍历具有浮点值(在for -loop或groupby级别)的NumPy数组时,Python需要将这些字节装箱到适当的Python对象中.迭代时,它将在内存中为数组中的每个单个值创建一个全新的Python对象.

The trouble is that when you ask Python to iterate over a NumPy array holding floating values (at a for-loop or groupby level), Python needs to box these bytes into a proper Python object. It creates a brand new Python object in memory for each single value in the array as it iterates.

例如,您可以看到在调用.tolist()时为每个nan值创建了不同的对象:

For example, you can see that that distinct objects for each nan value are created when .tolist() is called:

>>> [id(x) for x in arr.tolist()]
[4355054616, # nan
 4355054640, # nan
 4355054664, # nan
 4355054688,
 4355054712,
 4355054736,
 4355054760,
 4355054784,
 4355054808,
 4355054832,
 4355054856, # nan
 4355054880]

itertools.groupby能够在np.nan上分组以获取Python列表,因为它在比较Python对象时会先检查 identity .因为这些指向nan的指针都指向同一个np.nan对象,所以可以进行分组.

itertools.groupby is able to group on np.nan for the Python list because it checks for identity first when it compares Python objects. Because these pointers to nan all point at the same np.nan object, grouping is possible.

但是,在NumPy数组上进行迭代无法使此初始身份检查成功,因此Python会退回到检查是否相等和您所说的nan != nan.

However, iteration over the NumPy array does not allow this initial identity check to succeed, so Python falls back to checking for equality and nan != nan as you say.

这篇关于为什么itertools.groupby可以在列表中而不是在numpy数组中对NaN进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆