来自 pandas 数据帧的共现矩阵 [英] Cooccurence matrix from pandas dataframe
问题描述
我有一个熊猫数据框,我需要计算数据框中每个唯一条目出现在彼此的同一行中的行数。
I have a pandas dataframe, and I need count how many rows are there where each unique entry in the dataframe occurs within the same row of each other entry.
- Python单词列表中的共现矩阵:
与我的问题类似,但并非以数据框开头。大多数答案使用迭代。我希望Pandas中存在更好的解决方案。 - 在python pandas中构建共现矩阵:
这已经从一个数据帧开始,该数据帧中的主体只有0和1(我想代表实际值?),但是没有实际值。 - 将两列数据帧转换为熊猫中的出现矩阵:
这篇文章假设仅存在两列,这对此处讨论的情况有很大的限制
- Co-occurrence Matrix from list of words in Python: Similar question to mine, but does not start with a dataframe. Most answers use iterations. I hope a better solution exists in Pandas.
- Constructing a co-occurrence matrix in python pandas: This already starts with a dataframe where there are only 0 and 1 in the body (I guess representing the actual values?) but not the actual values.
- Convert Two column data frame to occurrence matrix in pandas: This post assumes there are two columns only, which is rather restrictive for the case discussed here
import pandas as pd
import numpy as np
数据帧:
df = pd.DataFrame({'a': ['A', 'A', 'B', 'B'],
'b': ['B', 'C', 'B', 'B'],
'c': ['C', 'A', 'C', 'A'],
'd': ['B', 'D', 'B', 'A']},
index=[0, 1, 2, 3])
ie:
+----+-----+-----+-----+-----+
| | a | b | c | d |
|----+-----+-----+-----+-----|
| 0 | A | B | C | B |
| 1 | A | C | A | D |
| 2 | B | B | C | B |
| 3 | B | B | A | A |
+----+-----+-----+-----+-----+
(使用此打印。)
(Printed using this.)
我试图使用来自答案的代码,&替换以下变量:
I have tried to use the code from answer, & substituting these variables:
document = [list(each) for each in df.values]
names = list(np.unique(df.values))
它给出了错误的结果:
A B C D
A 4 6 3 2
B 6 10 5 0
C 3 5 0 1
D 2 0 1 0
它基于迭代,所以我希望有一个更好的解决方案。
It is based on iteratations, so I would hope for a better solution.
+----+-----+-----+-----+-----+
| | A | B | C | D |
|----+-----+-----+-----+-----|
| A | nan | 2 | 2 | 1 |
| B | 2 | nan | 2 | 0 |
| C | 2 | 2 | nan | 1 |
| D | 1 | 0 | 1 | nan |
+----+-----+-----+-----+-----+
有 2
行,其中 A
& B
都出现,因此单元格行 A
列 B $ c中的值$ c>为
2
。
有 2
行,其中 A
& C
都出现,因此单元格行 A
列 C $ c中的值$ c>是
2
。
There are 2
rows where A
& B
both appears, so the value in the cell row A
column B
is 2
.
There are 2
rows where A
& C
both appears, so the value in the cell row A
column C
is 2
.
如何在Pandas中轻松获取此按行共现矩阵?如果我不必遍历所有值,那就太好了。
How can I get this row-wise cooccurence matrix easily in Pandas? It would be great if I didn't have to loop through the values.
(熊猫。分类可能有用,我还没有设法使它生效。)
(pandas.Categorical might be some use, I haven't managed to make it work yet.)
推荐答案
我们可以先进行堆
,然后进行 get_dummies
和 dot
然后取值
WE can do stack
then get_dummies
and dot
then value
s=df.stack().str.get_dummies().sum(level=0).ne(0).astype(int)
s=s.T.dot(s).astype(float)
np.fill_diagonal(s.values, np.nan)
s
Out[33]:
A B C D
A NaN 2.0 2.0 1.0
B 2.0 NaN 2.0 0.0
C 2.0 2.0 NaN 1.0
D 1.0 0.0 1.0 NaN
这篇关于来自 pandas 数据帧的共现矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!