来自 pandas 数据帧的共现矩阵 [英] Cooccurence matrix from pandas dataframe

查看:100
本文介绍了来自 pandas 数据帧的共现矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,我需要计算数据框中每个唯一条目出现在彼此的同一行中的行数。

I have a pandas dataframe, and I need count how many rows are there where each unique entry in the dataframe occurs within the same row of each other entry.

  • Co-occurrence Matrix from list of words in Python: Similar question to mine, but does not start with a dataframe. Most answers use iterations. I hope a better solution exists in Pandas.
  • Constructing a co-occurrence matrix in python pandas: This already starts with a dataframe where there are only 0 and 1 in the body (I guess representing the actual values?) but not the actual values.
  • Convert Two column data frame to occurrence matrix in pandas: This post assumes there are two columns only, which is rather restrictive for the case discussed here
import pandas as pd
import numpy as np

数据帧:

df = pd.DataFrame({'a': ['A', 'A', 'B', 'B'],
                   'b': ['B', 'C', 'B', 'B'],
                   'c': ['C', 'A', 'C', 'A'],
                   'd': ['B', 'D', 'B', 'A']},
                   index=[0, 1, 2, 3])

ie:

+----+-----+-----+-----+-----+
|    | a   | b   | c   | d   |
|----+-----+-----+-----+-----|
|  0 | A   | B   | C   | B   |
|  1 | A   | C   | A   | D   |
|  2 | B   | B   | C   | B   |
|  3 | B   | B   | A   | A   |
+----+-----+-----+-----+-----+

(使用打印。)

(Printed using this.)

我试图使用来自答案的代码,&替换以下变量:

I have tried to use the code from answer, & substituting these variables:

document = [list(each) for each in df.values]
names = list(np.unique(df.values))

它给出了错误的结果:

  A B C D
A 4 6 3 2
B 6 10 5 0
C 3 5 0 1
D 2 0 1 0

它基于迭代,所以我希望有一个更好的解决方案。

It is based on iteratations, so I would hope for a better solution.

+----+-----+-----+-----+-----+
|    |   A |   B |   C |   D |
|----+-----+-----+-----+-----|
| A  | nan |   2 |   2 |   1 |
| B  |   2 | nan |   2 |   0 |
| C  |   2 |   2 | nan |   1 |
| D  |   1 |   0 |   1 | nan |
+----+-----+-----+-----+-----+

2 行,其中 A & B 都出现,因此单元格行 A B 2
2 行,其中 A & C 都出现,因此单元格行 A C 2

There are 2 rows where A & B both appears, so the value in the cell row A column B is 2. There are 2 rows where A & C both appears, so the value in the cell row A column C is 2.

如何在Pandas中轻松获取此按行共现矩阵?如果我不必遍历所有值,那就太好了。

How can I get this row-wise cooccurence matrix easily in Pandas? It would be great if I didn't have to loop through the values.

熊猫。分类可能有用,我还没有设法使它生效。)

(pandas.Categorical might be some use, I haven't managed to make it work yet.)

推荐答案

我们可以先进行,然后进行 get_dummies dot 然后取值

WE can do stack then get_dummies and dot then value

s=df.stack().str.get_dummies().sum(level=0).ne(0).astype(int)
s=s.T.dot(s).astype(float)
np.fill_diagonal(s.values, np.nan)
s
Out[33]: 
     A    B    C    D
A  NaN  2.0  2.0  1.0
B  2.0  NaN  2.0  0.0
C  2.0  2.0  NaN  1.0
D  1.0  0.0  1.0  NaN

这篇关于来自 pandas 数据帧的共现矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆