带有许多列的Python Pandas成对频率表 [英] Python Pandas Pairwise Frequency Table with many columns

查看:82
本文介绍了带有许多列的Python Pandas成对频率表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

初学者熊猫在这里提问:

Beginner Pandas Question here:

如何为所有列创建交叉频率计数表??我想使用输出来绘制显示每一对列之间计数的海图热图.

How do I create a cross frequency count table for all columns? I want to ues the output to make a seaborn heatmap plot showing the counts between each pair of columns.

我有一个数据帧(从带有pyspark的hdfs中拉下来),具有约70个唯一列和约60万行

I have a dataframe (pulled down from hdfs with pyspark) with ~70 unique columns and about 600K rows

所需的样本输出:

    C1 C2 C3 C4 ...C70
C1  -  1  1  2
C2  1  -  0  2
C3  1  0  -  1
C4  2  2  1  -
...   
C70

样本DF:

import numpy as np
import pandas as pd
raw_data = {'C1': [ 0, 2, 5, 0, 3], #...600K
    'C2': [3, 0 , 2, 0, 0],
    'C3': [0, 0, 0, 3, 3],
    'C4': [2, 1, 1, 4, 0]} 
df = pd.DataFrame(raw_data, columns = ['C1', 'C2', 'C3','C4'])
print(df)

我尝试使用pandas的crosstab,pivot,pivot_table,并认为该解决方案正在使用crosstab,但是我无法以所需的输出格式来获取它(对不起,如果我缺少明显的东西).任何帮助表示赞赏!

I've tried using crosstab, pivot, pivot_table from pandas and think that the solution is using crosstab, but I can't get it in the desired output format (sorry if there is something obvious I'm missing). Any help appreciated!

推荐答案

使用clip_upper将正值剪切到1,然后计算点积:

Clip positive values to 1 with clip_upper, and then compute the dot product:

i = df.clip_upper(1)
j = i.T.dot(i)

j

    C1  C2  C3  C4
C1   3   1   1   2
C2   1   2   0   2
C3   1   0   2   1
C4   2   2   1   4

这篇关于带有许多列的Python Pandas成对频率表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆