来自 pandas 数据帧的成对矩阵 [英] Pairwise matrix from a pandas dataframe
问题描述
我有一个熊猫数据框,看起来像这样:
I have a pandas dataframe that looks something like this:
Al01 BBR60 CA07 NL219 AAEAMEVAT MP NaN MP MP AAFEDLRLL NaN NaN NaN NaN AAGAAVKGV NP NaN NP NP ADRGLLRDI NaN NP NaN NaN AEIMKICST PB1 NaN NaN PB1 AFDERRAGK NaN NaN NP NP AFDERRAGK NP NaN NaN NaNAl01 BBR60 CA07 NL219 AAEAMEVAT MP NaN MP MP AAFEDLRLL NaN NaN NaN NaN AAGAAVKGV NP NaN NP NP ADRGLLRDI NaN NP NaN NaN AEIMKICST PB1 NaN NaN PB1 AFDERRAGK NaN NaN NP NP AFDERRAGK NP NaN NaN NaN大约有一千行和六列.大多数单元格为空(NaN).考虑到不同的列中包含文本,我想知道每列中文本的概率是多少.例如,这里的小片段将产生如下内容:
There are a thousand or so rows and half a dozen columns. Most cells are empty (NaN). I would like to know what the probability of text in each column is, given that a different column has text in it. For example, the little snippet here would produce something like this:
Al01 BBR60 CA07 NL219 Al01 4 0 2 3 BBR60 0 1 0 0 CA07 2 0 3 3 NL219 3 0 3 4Al01 BBR60 CA07 NL219 Al01 4 0 2 3 BBR60 0 1 0 0 CA07 2 0 3 3 NL219 3 0 3 4这表示Al01栏中有4个匹配项;在这4个匹配中,在BBR60列中没有匹配,在CA07列中也有2个匹配,在NL219列中没有3个匹配.依此类推.
That says that there are 4 hits in the Al01 column; of those 4 hits, none are hits in the BBR60 column, 2 are also hits in the CA07 column, and 3 are hits in the NL219 column. And so on.
我可以遍历每一列并使用值构建字典,但这似乎很笨拙.有没有更简单的方法?
I can step through each column and build a dict with the values, but that seems clumsy. Is there a simpler approach?
推荐答案
它只是矩阵乘法:
import pandas as pd df = pd.read_csv('data.csv',index_col=0, delim_whitespace=True) df2 = df.applymap(lambda x: int(not pd.isnull(x))) print df2.T.dot(df2)
输出:
Al01 BBR60 CA07 NL219 Al01 4 0 2 3 BBR60 0 1 0 0 CA07 2 0 3 3 NL219 3 0 3 4 [4 rows x 4 columns]
这篇关于来自 pandas 数据帧的成对矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!