查找每个Sequence_ID的最大频率 [英] Find Max Frequency for every Sequence_ID
问题描述
我有一个类似的数据框:
I have a Dataframe Like:
Time Frq_1 Seq_1 Frq_2 Seq_2 Frq_3 Seq_3
12:43:04 - 30,668 - 30,670 4,620 30,671
12:46:05 - 30,699 - 30,699 3,280 30,700
12:46:17 4,200 30,700 - 30,704 - 30,704
12:46:18 3,060 30,700 4,200 30,700 - 30,700
12:46:18 3,060 30,700 4,200 30,700 - 30,700
12:46:19 3,060 30,700 4,220 30,700 - 30,700
12:46:20 3,060 30,700 4,240 30,700 - 30,700
12:46:37 - 30,698 - 30,699 3,060 30,700
12:46:38 - 30,699 3,060 30,700 4,600 30,700
12:47:19 - 30,668 - 30,669 - 30,669
12:47:20 - 30,667 - 30,667 - 30,668
12:47:20 - 30,667 - 30,667 - 30,668
12:47:21 - 30,667 - 30,667 - 30,668
12:47:21 - 30,665 - 30,665 - 30,665
12:47:22 - 30,665 - 30,665 - 30,665
12:48:35 - 30,688 - 30,690 3,020 30,690
12:49:29 4,160 30,690 - 30,691 - 30,693
我要检查总数据帧并找到符合以下条件的结果:
I want check the total dataframe and find the result with below condition:
- Sequence_ID,其频率不为空
- Sequence_ID的频率为最大(如果多个Sequence_ID的频率非零)
我希望得到如下结果:
Time Sequence_ID Frequency
12:43:04 4,620 30,671
12:46:18 4,200 30,700
12:49:29 4,160 30,690
Time =对应于(Sequence_ID& Frequency)的行
Time = correspond to row of (Sequence_ID & Frequency)
推荐答案
事实证明这很复杂.无论如何,我们在这里:
This turned out to be quite involved. Here we go anyway:
long_df = pd.wide_to_long(df.reset_index(), stubnames=['Seq_', 'Frq_'],
suffix='\d+', i='index', j='j')
long_df['Frq_'] = pd.to_numeric(long_df.Frq_.str.replace(',','.')
.replace('-',float('nan')))
long_df.reset_index(drop=True, inplace=True)
ix = long_df.groupby('Seq_').Frq_.idxmax()
print(long_df.loc[ix[ix.notna()].values.astype(int)])
Time Seq_ Frq_
34 12:43:04 30,671 4.62
16 12:49:29 30,690 4.16
42 12:46:38 30,700 4.60
似乎类似于序列30,700
,最高频率是4.60
,而不是4.20
Seems like for the sequence 30,700
, the highest frequency is 4.60
, not 4.20
第一步是将数据帧折叠成三行,一行用于Time
,另一行用于序列和频率.我们可以将pd.wide_to_long
与存根名称['Seq_', 'Frq_']
一起使用:
The first step is to collapse the dataframe into three rows, one for the Time
, another for the sequence and for the frequency. We can use pd.wide_to_long
with the stubnames ['Seq_', 'Frq_']
:
long_df = pd.wide_to_long(df.reset_index(), stubnames=['Seq_', 'Frq_'],
suffix='\d+', i='index', j='j')
print(long_df)
Time Seq_ Frq_
index j
0 1 12:43:04 30,668 -
1 1 12:46:05 30,699 -
2 1 12:46:17 30,700 4,200
3 1 12:46:18 30,700 3,060
4 1 12:46:18 30,700 3,060
5 1 12:46:19 30,700 3,060
6 1 12:46:20 30,700 3,060
7 1 12:46:37 30,698 -
8 1 12:46:38 30,699 -
9 1 12:47:19 30,668 -
10 1 12:47:20 30,667 -
11 1 12:47:20 30,667 -
12 1 12:47:21 30,667 -
13 1 12:47:21 30,665 -
14 1 12:47:22 30,665 -
15 1 12:48:35 30,688 -
16 1 12:49:29 30,690 4,160
...
下一步是强制将频率浮动到float
,以便能够找到最大值:
The next step is to cast to float the fequencies to float
, to be able to find the maximum values:
long_df['Frq_'] = pd.to_numeric(long_df.Frq_.str.replace(',','.')
.replace('-',float('nan')))
print(long_df)
Time Seq_ Frq_
index j
0 1 12:43:04 30,668 NaN
1 1 12:46:05 30,699 NaN
2 1 12:46:17 30,700 4.20
3 1 12:46:18 30,700 3.06
4 1 12:46:18 30,700 3.06
5 1 12:46:19 30,700 3.06
6 1 12:46:20 30,700 3.06
7 1 12:46:37 30,698 NaN
...
然后我们可以对Seq_
进行分组,并找到具有最高值的索引.也可以考虑使用max
,但这会删除Time
列.
Then we can groupby Seq_
and find the indices with the highest values. One could also think of using max
, but this would remove the Time
column.
long_df.reset_index(drop=True, inplace=True)
ix = long_df.groupby('Seq_').Frq_.idxmax()
最后根据以上内容建立索引:
And finally index based on the above:
print(long_df.loc[ix[ix.notna()].values.astype(int)])
Time Seq_ Frq_
34 12:43:04 30,671 4.62
16 12:49:29 30,690 4.16
42 12:46:38 30,700 4.60
这篇关于查找每个Sequence_ID的最大频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!