如何计算 DataFrame 中连续 TRUE 的数量? [英] How can I count the number of consecutive TRUEs in a DataFrame?

查看:28
本文介绍了如何计算 DataFrame 中连续 TRUE 的数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由 True 和 False 组成的数据集.

I have a dataset made of True and False.

Sample Table:
       A      B      C
0  False   True  False
1  False  False  False
2   True   True  False
3   True   True   True
4  False   True  False
5   True   True   True
6   True  False  False
7   True  False   True
8  False   True   True
9   True  False  False

我想计算每一列的连续 True 值的数量,如果有多个连续的 True 系列,我想得到它的最大值.

I want to count the number of consecutive True values for every column, and if there's more than one consecutive True series, I want to get the max of it.

对于上表,我会得到:

length = [3, 4, 2]

我发现了类似的线程,但没有一个能解决我的问题.

I found similar threads but none resolved my problem.

因为我有并且会有更多的列(产品),所以无论列名如何,我都需要为整个表执行此操作,并获得一个数组作为结果.

Since I do and will have many more columns(products), I need to do this regardless of the column name, for the whole table and get an array as the result.

如果可能的话,我想了解最长序列的第一个 true 的索引,也就是这个最长的 true 系列开始的地方,所以结果是这个:

And if possible, I'd like to learn the index of the first true of the longest sequence aka where this longest true series starts, so the result would be for this one:

index = [5, 2, 7]

推荐答案

我们基本上会利用两种哲学 - Catching shifts on compare数组偏移每列结果以便我们可以对其进行矢量化.

We would basically leverage two philosophies - Catching shifts on compared array and Offsetting each column results so that we could vectorize it.

所以,有了这个意图,这是实现预期结果的一种方法 -

So, with that intention set, here's one way to achieve the desired results -

def maxisland_start_len_mask(a, fillna_index = -1, fillna_len = 0):
    # a is a boolean array

    pad = np.zeros(a.shape[1],dtype=bool)
    mask = np.vstack((pad, a, pad))

    mask_step = mask[1:] != mask[:-1]
    idx = np.flatnonzero(mask_step.T)
    island_starts = idx[::2]
    island_lens = idx[1::2] - idx[::2]
    n_islands_percol = mask_step.sum(0)//2

    bins = np.repeat(np.arange(a.shape[1]),n_islands_percol)
    scale = island_lens.max()+1

    scaled_idx = np.argsort(scale*bins + island_lens)
    grp_shift_idx = np.r_[0,n_islands_percol.cumsum()]
    max_island_starts = island_starts[scaled_idx[grp_shift_idx[1:]-1]]

    max_island_percol_start = max_island_starts%(a.shape[0]+1)

    valid = n_islands_percol!=0
    cut_idx = grp_shift_idx[:-1][valid]
    max_island_percol_len = np.maximum.reduceat(island_lens, cut_idx)

    out_len = np.full(a.shape[1], fillna_len, dtype=int)
    out_len[valid] = max_island_percol_len
    out_index = np.where(valid,max_island_percol_start,fillna_index)
    return out_index, out_len

样品运行 -

# Generic case to handle all 0s columns
In [112]: a
Out[112]: 
array([[False, False, False],
       [False, False, False],
       [ True, False, False],
       [ True, False,  True],
       [False, False, False],
       [ True, False,  True],
       [ True, False, False],
       [ True, False,  True],
       [False, False,  True],
       [ True, False, False]])

In [117]: starts,lens = maxisland_start_len_mask(a, fillna_index=-1, fillna_len=0)

In [118]: starts
Out[118]: array([ 5, -1,  7])

In [119]: lens
Out[119]: array([3, 0, 2])

这篇关于如何计算 DataFrame 中连续 TRUE 的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆