在Python/Numpy/Pandas中查找连续值块的开始和停止 [英] Finding start and stops of consecutive values block in Python/Numpy/Pandas

查看:321
本文介绍了在Python/Numpy/Pandas中查找连续值块的开始和停止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在numpy数组或pandas DataFrame中找到相同值的块的开始和结束索引(对于2D数组,沿列的块;对于n维数组,沿变化最快的索引的块).我只在单个维度上查找块,并且不想在不同行上聚合nans.

I want to find the starts and stops indexes of blocks of identical values in a numpy array or preferably a pandas DataFrame (blocks along the column for a 2D array, and along the most quickly varying index for a n - dimensional array). I only look for blocks on a single dimension and don't want to agregate nans on different rows.

从该问题开始(在numpy数组中找到满足条件的大量连续值),我编写了以下解决方案,为2D数组找到np.nan:

Starting from that question (Find large number of consecutive values fulfilling condition in a numpy array), I wrote the following solution finding np.nan for a 2D array :

import numpy as np
a = np.array([
        [1, np.nan, np.nan, 2],
        [np.nan, 1, np.nan, 3], 
        [np.nan, np.nan, np.nan, np.nan]
    ])

nan_mask = np.isnan(a)
start_nans_mask = np.hstack((np.resize(nan_mask[:,0],(a.shape[0],1)),
                             np.logical_and(np.logical_not(nan_mask[:,:-1]), nan_mask[:,1:])
                             ))
stop_nans_mask = np.hstack((np.logical_and(nan_mask[:,:-1], np.logical_not(nan_mask[:,1:])),
                            np.resize(nan_mask[:,-1], (a.shape[0],1))
                            ))

start_row_idx,start_col_idx = np.where(start_nans_mask)
stop_row_idx,stop_col_idx = np.where(stop_nans_mask)

例如,这使我可以在应用pd.fillna之前分析缺少值的补丁的长度分布.

This lets me for example analyze the distribution of length of patches of missing values before applying pd.fillna.

stop_col_idx - start_col_idx + 1
array([2, 1, 1, 4], dtype=int64)

另一个示例和预期结果:

One more example and the expecting result :

a = np.array([
        [1, np.nan, np.nan, 2],
        [np.nan, 1, np.nan, np.nan], 
        [np.nan, np.nan, np.nan, np.nan]
    ])

array([2, 1, 2, 4], dtype=int64)

不是

array([2, 1, 6], dtype=int64)

我的问题如下:

  • 有没有一种方法可以优化我的解决方案(查找通过遮罩/位置操作的一次遍历开始和结束)?
  • 大熊猫中是否有更优化的解决方案? (即与仅在DataFrame的值上应用掩码/位置不同的解决方案)
  • 当基础数组或DataFrame变大以适合内存时会发生什么?

推荐答案

下面是针对任何维度(ndim = 2或更大)的基于numpy的实现:

Below a numpy-based implementation for any dimensionnality (ndim = 2 or more) :

def get_nans_blocks_length(a):
    """
    Returns 1D length of np.nan s block in sequence depth wise (last axis).
    """
    nan_mask = np.isnan(a)
    start_nans_mask = np.concatenate((np.resize(nan_mask[...,0],a.shape[:-1]+(1,)),
                                 np.logical_and(np.logical_not(nan_mask[...,:-1]), nan_mask[...,1:])
                                 ), axis=a.ndim-1)
    stop_nans_mask = np.concatenate((np.logical_and(nan_mask[...,:-1], np.logical_not(nan_mask[...,1:])),
                                np.resize(nan_mask[...,-1], a.shape[:-1]+(1,))
                                ), axis=a.ndim-1)

    start_idxs = np.where(start_nans_mask)
    stop_idxs = np.where(stop_nans_mask)
    return stop_idxs[-1] - start_idxs[-1] + 1

这样:

a = np.array([
        [1, np.nan, np.nan, np.nan],
        [np.nan, 1, np.nan, 2], 
        [np.nan, np.nan, np.nan, np.nan]
    ])
get_nans_blocks_length(a)
array([3, 1, 1, 4], dtype=int64)

然后:

a = np.array([
        [[1, np.nan], [np.nan, np.nan]],
        [[np.nan, 1], [np.nan, 2]], 
        [[np.nan, np.nan], [np.nan, np.nan]]
    ])
get_nans_blocks_length(a)
array([1, 2, 1, 1, 2, 2], dtype=int64)

这篇关于在Python/Numpy/Pandas中查找连续值块的开始和停止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆