如何计算 pandas 中最长的不间断序列 [英] How to count longest uninterrupted sequence in pandas

查看:70
本文介绍了如何计算 pandas 中最长的不间断序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说我有pd.Series如下

s = pd.Series([False, True, False,True,True,True,False, False])    

0    False
1     True
2    False
3     True
4     True
5     True
6    False
7    False
dtype: bool

我想知道最长的True序列有多长时间,在此示例中为3.

I want to know how long is the longest True sequence, in this example, it is 3.

我以愚蠢的方式尝试过.

I tried it in a stupid way.

s_list = s.tolist()
count = 0
max_count = 0
for item in s_list:
    if item:
        count +=1
    else:
        if count>max_count:
            max_count = count
        count = 0
print(max_count)

它将打印3,但是在所有TrueSeries中,它将打印0

It will print 3, but in a Series of all True, it will print 0

推荐答案

选项1
使用系列本身来掩盖求和的累加和.然后使用value_counts

Option 1
Use a the series itself to mask the cumulative sum of the negation. Then use value_counts

(~s).cumsum()[s].value_counts().max()

3

说明

  1. (~s).cumsum()是产生不同的True/False组的相当标准的方法

  1. (~s).cumsum() is a pretty standard way to produce distinct True/False groups

0    1
1    1
2    2
3    2
4    2
5    2
6    3
7    4
dtype: int64

  • 但是您可以看到我们关心的组由2表示,其中有四个.这是因为该组是由第一个False(通过(~s)变为True)启动的.因此,我们使用开始时使用的布尔掩码来掩码此累积和.

  • But you can see that the group we care about is represented by the 2s and there are four of them. That's because the group is initiated by the first False (which becomes True with (~s)). Therefore, we mask this cumulative sum with the boolean mask we started with.

    (~s).cumsum()[s]
    
    1    1
    3    2
    4    2
    5    2
    dtype: int64
    

  • 现在我们看到三个2弹出窗口,我们只需要使用一种方法来提取它们即可.我使用了value_countsmax.

  • Now we see the three 2s pop out and we just have to use a method to extract them. I used value_counts and max.


    选项2
    使用factorizebincount


    Option 2
    Use factorize and bincount

    a = s.values
    b = pd.factorize((~a).cumsum())[0]
    np.bincount(b[a]).max()
    
    3
    

    说明
    这与选项1的解释类似.主要区别在于我如何找到最大值.我使用pd.factorize将值标记为从0到唯一值总数的整数.鉴于我们在(~a).cumsum()中拥有的实际值,我们严格不需要此部分.我之所以使用它,是因为它是可以用于任意组名的通用工具.

    explanation
    This is a similar explanation as for option 1. The main difference is in how I a found the max. I use pd.factorize to tokenize the values into integers ranging from 0 to the total number of unique values. Given the actual values we had in (~a).cumsum() we didn't strictly need this part. I used it because it's a general purpose tool that could be used on arbitrary group names.

    pd.factorize之后,我在np.bincount中使用那些整数值,该整数值累计了使用每个整数的总次数.然后取最大值.

    After pd.factorize I use those integer values in np.bincount which accumulates the total number of times each integer is used. Then take the maximum.

    选项3
    如对选项2的说明中所述,这也适用:

    Option 3
    As stated in the explanation of option 2, this also works:

    a = s.values
    np.bincount((~a).cumsum()[a]).max()
    
    3
    

    这篇关于如何计算 pandas 中最长的不间断序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆