来自 pandas 数据帧的成对矩阵 [英] Pairwise matrix from a pandas dataframe

查看:63
本文介绍了来自 pandas 数据帧的成对矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,看起来像这样:

I have a pandas dataframe that looks something like this:

 Al01 BBR60 CA07 NL219
AAEAMEVAT MP NaN MP MP
AAFEDLRLL NaN NaN NaN NaN
AAGAAVKGV NP NaN NP NP
ADRGLLRDI NaN NP NaN NaN
AEIMKICST PB1 NaN NaN PB1
AFDERRAGK NaN NaN NP NP
AFDERRAGK NP NaN NaN NaN

             Al01   BBR60   CA07    NL219
AAEAMEVAT    MP      NaN     MP      MP 
AAFEDLRLL    NaN     NaN     NaN     NaN
AAGAAVKGV    NP      NaN     NP      NP 
ADRGLLRDI    NaN     NP      NaN     NaN 
AEIMKICST    PB1     NaN     NaN     PB1 
AFDERRAGK    NaN     NaN     NP      NP 
AFDERRAGK    NP      NaN     NaN     NaN

大约有一千行和六列.大多数单元格为空(NaN).考虑到不同的列中包含文本,我想知道每列中文本的概率是多少.例如,这里的小片段将产生如下内容:

There are a thousand or so rows and half a dozen columns. Most cells are empty (NaN). I would like to know what the probability of text in each column is, given that a different column has text in it. For example, the little snippet here would produce something like this:

            Al01 BBR60 CA07 NL219
Al01 4 0 2 3
BBR60 0 1 0 0
CA07 2 0 3 3
NL219 3 0 3 4

            Al01    BBR60   CA07    NL219
Al01        4       0       2       3
BBR60       0       1       0       0
CA07        2       0       3       3
NL219       3       0       3       4

这表示Al01栏中有4个匹配项;在这4个匹配中,在BBR60列中没有匹配,在CA07列中也有2个匹配,在NL219列中没有3个匹配.依此类推.

That says that there are 4 hits in the Al01 column; of those 4 hits, none are hits in the BBR60 column, 2 are also hits in the CA07 column, and 3 are hits in the NL219 column. And so on.

我可以遍历每一列并使用值构建字典,但这似乎很笨拙.有没有更简单的方法?

I can step through each column and build a dict with the values, but that seems clumsy. Is there a simpler approach?

推荐答案

它只是矩阵乘法:

import pandas as pd
df = pd.read_csv('data.csv',index_col=0, delim_whitespace=True)
df2 = df.applymap(lambda x: int(not pd.isnull(x)))
print df2.T.dot(df2)

输出:

           Al01  BBR60  CA07  NL219
Al01      4      0     2      3
BBR60     0      1     0      0
CA07      2      0     3      3
NL219     3      0     3      4

[4 rows x 4 columns]

这篇关于来自 pandas 数据帧的成对矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆