在matplotlib中绘制关联图 [英] Drawing a correlation graph in matplotlib

查看:270
本文介绍了在matplotlib中绘制关联图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个具有 n = 2 的离散向量数据集:

  DATA = [
('a',4),
('b',5),
('c',5),
('d ',4),
('e',2),
('f',5),
]

如何使用matplotlib绘制该数据集,以便可视化两个变量之间的任何关联?

任何简单代码示例会很棒。

解决方案

Joe Kington有正确的答案,但是您的 DATA 可能比较复杂。它可能在'a'有多个值。 Joe构建x轴值的方式很快,但只适用于唯一值列表。可能有更快的方法来做到这一点,但这是我如何完成它:

b
def assignIDs(list):
'''取一个字符串列表,并为每个唯一值分配一个数字。
返回unique-val - > id的映射。
'''
sortedList = sorted(list)

#taken from
#http://stackoverflow.com/questions/480214/how-do-you 480227
seen = set()
seen_add = seen.add
uniqueList = [x对于在sortedList中的x,如果x不在看到而且不是seen_add(x)]

return dict(zip(uniqueList,range(len(uniqueList))))

def plotData (inData,color):
x,y = zip(* inData)

xMap = assignIDs(x)
xAsInts = [xMap [i] for i in x]


plt.scatter(xAsInts,y,color = color)
plt.xticks(xMap.values(),xMap.keys())


DATA = [
('a',4),
('b',5),
('c',5),
('d ',4),
('e',2),
('f',5),
]


DATA2 = [
('c',3),
('b',4),
('c',4),
('d',3),
('e',1),
('f ,4),
('a',5),
('b',7),
('c',7),
('d',6 ),
('e',4),
('f',7),
]

plotData(DATA,'blue')
plotData(DATA2,'red')

plt.gcf()。savefig(correlation.png)

我的 DATA2 集合对于每个x轴值都有两个值。它被绘制在下面的红色:



编辑



您提出的问题非常广泛。我搜索了相关性,:



如果线性逼近是有用的,您可以通过查看拟合来定性地确定,您可能希望在平坦y方向之前减去此趋势。这将有助于表明您有关于线性趋势的高斯随机分布。

Suppose I have a data set of discrete vectors with n=2:

DATA = [
    ('a', 4),
    ('b', 5),
    ('c', 5),
    ('d', 4),
    ('e', 2),
    ('f', 5),
]

How can I plot that data set with matplotlib so as to visualize any correlation between the two variables?

Any simple code examples would be great.

解决方案

Joe Kington has the correct answer, but your DATA probably is more complicated that is represented. It might have multiple values at 'a'. The way Joe builds the x axis values is quick but would only work for a list of unique values. There may be a faster way to do this, but this how I accomplished it:

import matplotlib.pyplot as plt

def assignIDs(list):
    '''Take a list of strings, and for each unique value assign a number.
    Returns a map for "unique-val"->id.
    '''
    sortedList = sorted(list)

    #taken from
    #http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-python-whilst-preserving-order/480227#480227
    seen = set()
    seen_add = seen.add
    uniqueList =  [ x for x in sortedList if x not in seen and not seen_add(x)]

    return  dict(zip(uniqueList,range(len(uniqueList))))

def plotData(inData,color):
    x,y = zip(*inData)

    xMap = assignIDs(x)
    xAsInts = [xMap[i] for i in x]


    plt.scatter(xAsInts,y,color=color)
    plt.xticks(xMap.values(),xMap.keys())


DATA = [
    ('a', 4),
    ('b', 5),
    ('c', 5),
    ('d', 4),
    ('e', 2),
    ('f', 5),
]


DATA2 = [
    ('a', 3),
    ('b', 4),
    ('c', 4),
    ('d', 3),
    ('e', 1),
    ('f', 4),
    ('a', 5),
    ('b', 7),
    ('c', 7),
    ('d', 6),
    ('e', 4),
    ('f', 7),
]

plotData(DATA,'blue')
plotData(DATA2,'red')

plt.gcf().savefig("correlation.png")

My DATA2 set has two values for every x axis value. It's plotted in red below:

EDIT

The question you asked is very broad. I searched 'correlation', and Wikipedia had a good discussion on Pearson's product-moment coefficient, which characterizes the slope of a linear fit. Keep in mind that this value is only a guide, and in no way predicts whether or not a linear fit is a reasonable assumption, see the notes in the above page on correlation and linearity. Here is an updated plotData method, which uses numpy.linalg.lstsq to do linear regression and numpy.corrcoef to calculate Pearson's R:

import matplotlib.pyplot as plt
import numpy as np

def plotData(inData,color):
    x,y = zip(*inData)

    xMap = assignIDs(x)
    xAsInts = np.array([xMap[i] for i in x])

    pearR = np.corrcoef(xAsInts,y)[1,0]
    # least squares from:
    # http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html
    A = np.vstack([xAsInts,np.ones(len(xAsInts))]).T
    m,c = np.linalg.lstsq(A,np.array(y))[0]

    plt.scatter(xAsInts,y,label='Data '+color,color=color)
    plt.plot(xAsInts,xAsInts*m+c,color=color,
             label="Fit %6s, r = %6.2e"%(color,pearR))
    plt.xticks(xMap.values(),xMap.keys())
    plt.legend(loc=3)

The new figure is:

Also flattening each direction and looking at the individual distributions might be useful, and their are examples of doing this in matplotlib:

If a linear approximation is useful, which you can determine qualitatively by just looking at the fit, you might want to subtract out this trend before flatting the y direction. This would help show that you have a Gaussian random distribution about a linear trend.

这篇关于在matplotlib中绘制关联图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆