pandas -非重叠小组成员 [英] Pandas - non overlapping group members

查看:48
本文介绍了 pandas -非重叠小组成员的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框:

id  start   end     score
C1  2       592     157
C1  179     592     87
C1  113     553     82
C2  152     219     350
C2  13      70      319
C2  13      70      188
C2  15      70      156
C2  87      139     130
C2  92      140     102
C3  18      38      348
C3  20      35      320
C3  31      57      310
C4  347     51      514

数据按ID和分数排序.

The data is ordered by the id and the score.

id代表DNA序列.

id represents a sequence of DNA.

开始和结束代表id中的位置,我想保留不重叠的切片,并从重叠的位置中仅保留最高的排名:

Start and end represent positions in id and I would like to keep non overlapping slices and from the overlapping only the highest ranked:

id  start   end score
C1  2   592 157
C2  152 219 350
C2  13  70  319
C2  87  139 130
C3  18  38  348
C4  347 51  514

有什么想法吗?

谢谢

推荐答案

这更短,并且满足所有要求.您需要:

This is shorter and meets all requirements. You need:

  1. 一种检查重叠的方法
  2. 一种按ID对数据进行分组的方法
  3. 一种在检查重叠后从每个组中获取最佳收益的方法.

这一切都是通过使用逻辑和groupby

This does all of those, cheating by using logic and groupby

# from Ned Batchfelder
# http://nedbatchelder.com/blog/201310/range_overlap_in_two_compares.html
def overlap(start1, end1, start2, end2):
    """
    Does the range (start1, end1) overlap with (start2, end2)?
    """
    return end1 >= start2 and end2 >= start1

def compare_rows(group):
    winners = []
    skip = []
    if len(group) == 1:
        return group[['start', 'end', 'score']]
    for i in group.index:
        if i in skip:
            continue
        for j in group.index:
            last = j == group.index[-1]
            istart = group.loc[i, 'start']
            iend = group.loc[i, 'end']
            jstart = group.loc[j, 'start']
            jend = group.loc[j, 'end']
            if overlap(istart, iend, jstart, jend):
                winner = group.loc[[i, j], 'score'].idxmax()
                if winner == j:
                    winners.append(winner)
                    skip.append(i)
                    break
            if last:
                winners.append(i)
    return group.loc[winners, ['start', 'end', 'score']].drop_duplicates()

grouped = df.groupby('id')
print grouped.apply(compare_rows)

这篇关于 pandas -非重叠小组成员的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆