计算数据帧中的连续数,并在发生这种情况的情况下获取索引 [英] Count consecutive ones in a dataframe and get indices where this occurs
问题描述
我有一个pandas.DataFrame
,带有整数列名,该列名有零和一.输入的示例:
I have a pandas.DataFrame
with integer column names, which has zeroes and ones. An example of the input:
12 13 14 15
1 0 0 1 0
2 0 0 1 1
3 1 0 0 1
4 1 1 0 1
5 1 1 1 0
6 0 0 1 0
7 0 0 1 1
8 1 1 0 1
9 0 0 1 1
10 0 0 1 1
11 1 1 0 1
12 1 1 1 1
13 1 1 1 1
14 1 0 1 1
15 0 0 1 1
我需要计算长度/总和为> = 2的所有连续数,遍历各列,并返回出现连续数数组(起始,结束)的索引.
I need to count all consecutive ones which has a length/sum which is >=2, iterating through columns and returning also indices where an array of the consecutive ones occurs (start, end).
首选输出将是3D DataFrame,其中子列"count"和"indices"是指输入中的整数列名称.
The preferred output would be a 3D DataFrame, where subcolumns "count" and "indices" refer to integer column names from the input.
示例输出如下所示:
12 13 14 15
count indices count indices count indices count indices
3 (3,5) 2 (4,5) 2 (1,2) 3 (2,4)
4 (11,14) 3 (11,13) 3 (5,7) 9 (7,15)
2 (9,10)
4 (12,15)
我想应该用itertools.groupby
来解决它,但是仍然无法弄清楚如何将其应用到这样的问题中,在该问题中,同时提取了groupby
的结果及其索引.
I suppose it should be solved with itertools.groupby
, but still can't figure out how to apply it to such problem, where both groupby
results and its indices are being extracted.
推荐答案
这里是计算所需游程长度的一种方法:
Here is one way to calculate the desired run lengths:
代码:
def min_run_length(series):
terminal = pd.Series([0])
diffs = pd.concat([terminal, series, terminal]).diff()
starts = np.where(diffs == 1)
ends = np.where(diffs == -1)
return [(e-s, (s, e-1)) for s, e in zip(starts[0], ends[0])
if e - s >= 2]
测试代码:
df = pd.read_fwf(StringIO(u"""
12 13 14 15
0 0 1 0
0 0 1 1
1 0 0 1
1 1 0 1
1 1 1 0
0 0 1 0
0 0 1 1
1 1 0 1
0 0 1 1
0 0 1 1
1 1 0 1
1 1 1 1
1 1 1 1
1 0 1 1
0 0 1 1"""), header=1)
print(df.dtypes)
indices = {cname: min_run_length(df[cname]) for cname in df.columns}
print(indices)
结果:
{
u'12': [(3, (3, 5)), (4, (11, 14))],
u'13': [(2, (4, 5)), (3, (11, 13))],
u'14': [(2, (1, 2)), (3, (5, 7)), (2, (9, 10)), (4, (12, 15))]
u'15': [(3, (2, 4)), (9, (7, 15))],
}
这篇关于计算数据帧中的连续数,并在发生这种情况的情况下获取索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!