如何计算 pandas 中最长的不间断序列 [英] How to count longest uninterrupted sequence in pandas
问题描述
比方说我有pd.Series
如下
s = pd.Series([False, True, False,True,True,True,False, False])
0 False
1 True
2 False
3 True
4 True
5 True
6 False
7 False
dtype: bool
我想知道最长的True
序列有多长时间,在此示例中为3.
I want to know how long is the longest True
sequence, in this example, it is 3.
我以愚蠢的方式尝试过.
I tried it in a stupid way.
s_list = s.tolist()
count = 0
max_count = 0
for item in s_list:
if item:
count +=1
else:
if count>max_count:
max_count = count
count = 0
print(max_count)
它将打印3
,但是在所有True
的Series
中,它将打印0
It will print 3
, but in a Series
of all True
, it will print 0
推荐答案
选项1
使用系列本身来掩盖求和的累加和.然后使用value_counts
Option 1
Use a the series itself to mask the cumulative sum of the negation. Then use value_counts
(~s).cumsum()[s].value_counts().max()
3
说明
-
(~s).cumsum()
是产生不同的True
/False
组的相当标准的方法
(~s).cumsum()
is a pretty standard way to produce distinctTrue
/False
groups
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 4
dtype: int64
但是您可以看到我们关心的组由2
表示,其中有四个.这是因为该组是由第一个False
(通过(~s)
变为True
)启动的.因此,我们使用开始时使用的布尔掩码来掩码此累积和.
But you can see that the group we care about is represented by the 2
s and there are four of them. That's because the group is initiated by the first False
(which becomes True
with (~s)
). Therefore, we mask this cumulative sum with the boolean mask we started with.
(~s).cumsum()[s]
1 1
3 2
4 2
5 2
dtype: int64
现在我们看到三个2
弹出窗口,我们只需要使用一种方法来提取它们即可.我使用了value_counts
和max
.
Now we see the three 2
s pop out and we just have to use a method to extract them. I used value_counts
and max
.
选项2
使用factorize
和bincount
Option 2
Use factorize
and bincount
a = s.values
b = pd.factorize((~a).cumsum())[0]
np.bincount(b[a]).max()
3
说明
这与选项1的解释类似.主要区别在于我如何找到最大值.我使用pd.factorize
将值标记为从0到唯一值总数的整数.鉴于我们在(~a).cumsum()
中拥有的实际值,我们严格不需要此部分.我之所以使用它,是因为它是可以用于任意组名的通用工具.
explanation
This is a similar explanation as for option 1. The main difference is in how I a found the max. I use pd.factorize
to tokenize the values into integers ranging from 0 to the total number of unique values. Given the actual values we had in (~a).cumsum()
we didn't strictly need this part. I used it because it's a general purpose tool that could be used on arbitrary group names.
在pd.factorize
之后,我在np.bincount
中使用那些整数值,该整数值累计了使用每个整数的总次数.然后取最大值.
After pd.factorize
I use those integer values in np.bincount
which accumulates the total number of times each integer is used. Then take the maximum.
选项3
如对选项2的说明中所述,这也适用:
Option 3
As stated in the explanation of option 2, this also works:
a = s.values
np.bincount((~a).cumsum()[a]).max()
3
这篇关于如何计算 pandas 中最长的不间断序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!