pandas -非重叠小组成员 [英] Pandas - non overlapping group members
本文介绍了 pandas -非重叠小组成员的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下数据框:
id start end score
C1 2 592 157
C1 179 592 87
C1 113 553 82
C2 152 219 350
C2 13 70 319
C2 13 70 188
C2 15 70 156
C2 87 139 130
C2 92 140 102
C3 18 38 348
C3 20 35 320
C3 31 57 310
C4 347 51 514
数据按ID和分数排序.
The data is ordered by the id and the score.
id代表DNA序列.
id represents a sequence of DNA.
开始和结束代表id中的位置,我想保留不重叠的切片,并从重叠的位置中仅保留最高的排名:
Start and end represent positions in id and I would like to keep non overlapping slices and from the overlapping only the highest ranked:
id start end score
C1 2 592 157
C2 152 219 350
C2 13 70 319
C2 87 139 130
C3 18 38 348
C4 347 51 514
有什么想法吗?
谢谢
推荐答案
这更短,并且满足所有要求.您需要:
This is shorter and meets all requirements. You need:
- 一种检查重叠的方法
- 一种按ID对数据进行分组的方法
- 一种在检查重叠后从每个组中获取最佳收益的方法.
这一切都是通过使用逻辑和groupby
This does all of those, cheating by using logic and groupby
# from Ned Batchfelder
# http://nedbatchelder.com/blog/201310/range_overlap_in_two_compares.html
def overlap(start1, end1, start2, end2):
"""
Does the range (start1, end1) overlap with (start2, end2)?
"""
return end1 >= start2 and end2 >= start1
def compare_rows(group):
winners = []
skip = []
if len(group) == 1:
return group[['start', 'end', 'score']]
for i in group.index:
if i in skip:
continue
for j in group.index:
last = j == group.index[-1]
istart = group.loc[i, 'start']
iend = group.loc[i, 'end']
jstart = group.loc[j, 'start']
jend = group.loc[j, 'end']
if overlap(istart, iend, jstart, jend):
winner = group.loc[[i, j], 'score'].idxmax()
if winner == j:
winners.append(winner)
skip.append(i)
break
if last:
winners.append(i)
return group.loc[winners, ['start', 'end', 'score']].drop_duplicates()
grouped = df.groupby('id')
print grouped.apply(compare_rows)
这篇关于 pandas -非重叠小组成员的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文