使用 pandas 聚合所有数据框行对组合 [英] Aggregate all dataframe row pair combinations using pandas

查看:80
本文介绍了使用 pandas 聚合所有数据框行对组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用python pandas跨数据帧执行分组和聚合,但是我现在想对行进行特定的成对聚合(n选择2,统计组合).这是示例数据,在这里我想查看[mygenes]中的所有基因对:

I use python pandas to perform grouping and aggregation across data frames, but I would like to now perform specific pairwise aggregation of rows (n choose 2, statistical combination). Here is the example data, where I would like to look at all pairs of genes in [mygenes]:

import pandas
import itertools

mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']

df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
                       'case1'   : [0,1,1,0,0],
                       'case2'   : [1,1,1,0,1],
                       'control1':[0,0,1,1,1],
                       'control2':[1,0,0,1,0] })
>>> df
   Gene  case1  case2  control1  control2
0  ABC1      0      1         0         1
1  ABC2      1      1         0         0
2  ABC3      1      1         1         0
3  ABC4      0      0         1         1
4  ABC5      0      1         1         0

最终产品应如下所示(默认情况下,应用np.sum很好):

The final product should look like this (applying np.sum by default is fine):

                 case1    case2    control1    control2
'ABC1', 'ABC2'    1         2         0            1
'ABC1', 'ABC3'    1         2         1            1
'ABC1', 'ABC4'    0         1         1            2
'ABC2', 'ABC3'    2         2         1            0
'ABC2', 'ABC4'    1         1         1            1
'ABC3', 'ABC4'    1         1         2            1 

可以使用itertools($ itertools.combinations(mygenes, 2))轻松获得基因对的集合,但是我无法弄清楚如何根据值对特定行进行汇总.有人可以建议吗?谢谢

The set of gene pairs can be easily obtained with itertools ($itertools.combinations(mygenes, 2)), but I can't figure out how to perform aggregation of specific rows based on their values. Can anyone advise? Thank you

推荐答案

我想不出一种聪明的矢量化方法来做到这一点,但是除非性能是真正的瓶颈,否则我倾向于使用最简单的有意义的方法.在这种情况下,我可能会set_index("Gene")然后使用loc选择行:

I can't think of a clever vectorized way to do this, but unless performance is a real bottleneck I tend to use the simplest thing which makes sense. In this case, I might set_index("Gene") and then use loc to pick out the rows:

>>> df = df.set_index("Gene")
>>> cc = list(combinations(mygenes,2))
>>> out = pd.DataFrame([df.loc[c,:].sum() for c in cc], index=cc)
>>> out
              case1  case2  control1  control2
(ABC1, ABC2)      1      2         0         1
(ABC1, ABC3)      1      2         1         1
(ABC1, ABC4)      0      1         1         2
(ABC2, ABC3)      2      2         1         0
(ABC2, ABC4)      1      1         1         1
(ABC3, ABC4)      1      1         2         1

这篇关于使用 pandas 聚合所有数据框行对组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆