具有两个类别变量的Matplotlib点图 [英] Matplotlib dot plot with two categorical variables

查看:110
本文介绍了具有两个类别变量的Matplotlib点图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想产生一种特定类型的可视化,包括一个相当简单的



您知道有什么优雅的方法或库可以解决我的问题吗?



我自己开始写一些东西,这将在下面进行介绍,但是这种实现方式不是最优的,并且受同一点可以重叠的点数的限制(当前,如果超过4点,它将失败。点重叠)。

 #模块#
从六个导入StringIO $中导入seaborn,pandas,matplotlib
b
$ b ########################################## ###################################
def amount_to_offets(amount):
需要大量重叠点的函数(例如, 3)
,并返回每个
点的偏移(抖动)坐标列表。

遵循逻辑,即并排显示两个点:$ b​​
$ b 2-> * *

三个点被组织成三角形

3-> *
* *

四个点被分类为一个正方形,依此类推。

4-> * *
* *

assert isinstance(amount,int)
solutions = {
1:[(0.0,0.0)],
2:[[--0.5,0.0),(0.5,0.0)],
3:[[--0.5,-0.5),(0.0,0.5),(0.5,-0.5)],
4:[(-0.5,-0.5),(0.5,0.5),(0.5,-0.5),(-0.5,0.5)],
}
退货解决方案[金额]

########################################### #################################
class JitterDotplot(object):

def __init __(self,data,x_col ='time',y_col ='sex',z_col ='tip'):
self.data =数据
self.x_col = x_col
self.y_col = y_col
self.z_col = z_col

def plot(self,** kwargs):
#加载数据#
self.df = self .data.copy()

#为分类数据分配数值#
#使['Dinner','Lunch']变为[0,1]等。#
self.x_values = self.df [self.x_col] .unique()
self.y_values = self.df [self.y_col] .unique()
self.x_mapping = dict(zip(self.x_values,range(len(self.x_values)))))
self.y_mapping = dict(zip(zip(self.y_values,range(len(self.y_values)))))
self.df = self.df.replace({self.x_col:self.x_mapping,self.y_col:self.y_mapping })

#在同一位置重叠的偏移点#
#例如(
cols = [self(2.0,3.0))变成(2.05,2.95) .x_col,self.y_col]
缩放系数= 0.05
(值),self.df.groupby(cols)中的df_view:
偏移量= amount_to_offets(len(df_view))
偏移量= pandas.DataFrame(offsets,index = df_view.index,colums = cols)
偏移量* = scaling_factor
self.df.loc [offsets.index,cols] + =偏移量

#绘制标准散点图#
g = seaborn.scatterplot(x = self.x_col,y = self.y_col,size = self.z_col,data = self.df,** kw args)

#在x和y轴上强制整数刻度线#
locator = matplotlib.ticker.MaxNLocator(integer = True)
g.xaxis.set_major_locator(locator)
g.yaxis.set_major_locator(定位器)
g.grid(False)

#扩展x和y的轴限制#
保证金= 0.4
xmin,xmax,ymin,ymax = g.get_xlim()+ g.get_ylim()
g.set_xlim(xmin-margin,xmax + margin)
g.set_ylim(ymin-margin,ymax + margin) )

#用原始类别名称替换刻度线#
g.set_xticklabels([] + list(self.x_mapping.keys()))
g.set_yticklabels( [''] + list(self.y_mapping.keys()))

#返回在笔记本中显示的实例,例如#
return g

### ################################################ ##########################
#图#
图= JitterDotplot(data = df)
axes = graph.plot()
axes.figure.savefig( 'jitter_dotplot.png')

解决方案

您可以先转换时间性别进行分类并对其进行一些调整:

  df.sex = pd.Categorical(df.sex)
df.time = pd.Categorical(df.time)

轴= sns.scatterplot(x = df.time.cat.codes + np.random.uniform(-0.1,0.1,len (df)),
y = df.sex.cat.codes + np.random.uniform(-0.1,0.1,len(df)),
size = df.tip)

输出:





有了这个主意,您可以将上述代码中的偏移量( np.random )修改为的各自的距离。例如:

 #grouping 
groups = df.groupby(['time','sex'])

#计算每组样本的数量
num_samples = groups.tip.transform('size')

#枚举组
中的样本sample_ranks = df.groupby(['time'])。cumcount()*(2 * np.pi)/ num_samples

#计算偏移量
x_offsets = np.where(num_samples。 eq(1),0,np.cos(df.sample_rank)* 0.03)
y_offsets = np.where(num_samples.eq(1),0,np.sin(df.sample_rank)* 0.03)

#绘图
轴= sns.scatterplot(x = df.time.cat.codes + x_offsets,
y = df.sex.cat.codes + y_offsets,
大小= df.tip)

输出:




I would like to produce a specific type of visualization, consisting of a rather simple dot plot but with a twist: both of the axes are categorical variables (i.e. ordinal or non-numerical values). And this complicates matters instead of making it easier.

To illustrate this question, I will be using a small example dataset that is a modification from seaborn.load_dataset("tips") and defined as such:

import pandas
from six import StringIO
df = """total_bill |  tip  |    sex | smoker | day |   time | size
             16.99 | 1.01  |   Male |     No | Mon | Dinner |    2
             10.34 | 1.66  |   Male |     No | Sun | Dinner |    3
             21.01 | 3.50  |   Male |     No | Sun | Dinner |    3
             23.68 | 3.31  |   Male |     No | Sun | Dinner |    2
             24.59 | 3.61  | Female |     No | Sun | Dinner |    4
             25.29 | 4.71  | Female |     No | Mon | Lunch  |    4
              8.77 | 2.00  | Female |     No | Tue | Lunch  |    2
             26.88 | 3.12  |   Male |     No | Wed | Lunch  |    4
             15.04 | 3.96  |   Male |     No | Sat | Lunch  |    2
             14.78 | 3.23  |   Male |     No | Sun | Lunch  |    2"""
df = pandas.read_csv(StringIO(df.replace(' ','')), sep="|", header=0)

My first approach to produce my graph was to try a call to seaborn as such:

import seaborn
axes = seaborn.pointplot(x="time", y="sex", data=df)

This fails with:

ValueError: Neither the `x` nor `y` variable appears to be numeric.

So does the equivalent seaborn.stripplot and seaborn.swarmplot calls. It does work however if one of the variables is categorical and the other one is numerical. Indeed seaborn.pointplot(x="total_bill", y="sex", data=df) works, but is not what I want.

I also attempted a scatterplot like such:

axes = seaborn.scatterplot(x="time", y="sex", size="day", data=df,
                           x_jitter=True, y_jitter=True)

This produces the following graph which does not contain any jitter and has all the dots overlapping, making it useless:

Do you know of any elegant approach or library that could solve my problem ?

I started writing something myself, which I will include below, but this implementation is suboptimal and limited by the number of points that can overlap at the same spot (currently it fails if more than 4 points overlap).

# Modules #
import seaborn, pandas, matplotlib
from six import StringIO

################################################################################
def amount_to_offets(amount):
    """A function that takes an amount of overlapping points (e.g. 3)
    and returns a list of offsets (jittered) coordinates for each of the
    points.

    It follows the logic that two points are displayed side by side:

    2 ->  * *

    Three points are organized in a triangle

    3 ->   *
          * *

    Four points are sorted into a square, and so on.

    4 ->  * *
          * *
    """
    assert isinstance(amount, int)
    solutions = {
        1: [( 0.0,  0.0)],
        2: [(-0.5,  0.0), ( 0.5,  0.0)],
        3: [(-0.5, -0.5), ( 0.0,  0.5), ( 0.5, -0.5)],
        4: [(-0.5, -0.5), ( 0.5,  0.5), ( 0.5, -0.5), (-0.5,  0.5)],
    }
    return solutions[amount]

################################################################################
class JitterDotplot(object):

    def __init__(self, data, x_col='time', y_col='sex', z_col='tip'):
        self.data = data
        self.x_col = x_col
        self.y_col = y_col
        self.z_col = z_col

    def plot(self, **kwargs):
        # Load data #
        self.df = self.data.copy()

        # Assign numerical values to the categorical data #
        # So that ['Dinner', 'Lunch'] becomes [0, 1] etc. #
        self.x_values = self.df[self.x_col].unique()
        self.y_values = self.df[self.y_col].unique()
        self.x_mapping = dict(zip(self.x_values, range(len(self.x_values))))
        self.y_mapping = dict(zip(self.y_values, range(len(self.y_values))))
        self.df = self.df.replace({self.x_col: self.x_mapping, self.y_col: self.y_mapping})

        # Offset points that are overlapping in the same location #
        # So that (2.0, 3.0) becomes (2.05, 2.95) for instance #
        cols = [self.x_col, self.y_col]
        scaling_factor = 0.05
        for values, df_view in self.df.groupby(cols):
            offsets = amount_to_offets(len(df_view))
            offsets = pandas.DataFrame(offsets, index=df_view.index, columns=cols)
            offsets *= scaling_factor
            self.df.loc[offsets.index, cols] += offsets

        # Plot a standard scatter plot #
        g = seaborn.scatterplot(x=self.x_col, y=self.y_col, size=self.z_col, data=self.df, **kwargs)

        # Force integer ticks on the x and y axes #
        locator = matplotlib.ticker.MaxNLocator(integer=True)
        g.xaxis.set_major_locator(locator)
        g.yaxis.set_major_locator(locator)
        g.grid(False)

        # Expand the axis limits for x and y #
        margin = 0.4
        xmin, xmax, ymin, ymax = g.get_xlim() + g.get_ylim()
        g.set_xlim(xmin-margin, xmax+margin)
        g.set_ylim(ymin-margin, ymax+margin)

        # Replace ticks with the original categorical names #
        g.set_xticklabels([''] + list(self.x_mapping.keys()))
        g.set_yticklabels([''] + list(self.y_mapping.keys()))

        # Return for display in notebooks for instance #
        return g

################################################################################
# Graph #
graph = JitterDotplot(data=df)
axes  = graph.plot()
axes.figure.savefig('jitter_dotplot.png')

解决方案

you could first convert time and sex to categorical type and tweak it a little bit:

df.sex = pd.Categorical(df.sex)
df.time = pd.Categorical(df.time)

axes = sns.scatterplot(x=df.time.cat.codes+np.random.uniform(-0.1,0.1, len(df)), 
                       y=df.sex.cat.codes+np.random.uniform(-0.1,0.1, len(df)),
                       size=df.tip)

Output:

With that idea, you can modify the offsets (np.random) in the above code to the respective distance. For example:

# grouping
groups = df.groupby(['time', 'sex'])

# compute the number of samples per group
num_samples = groups.tip.transform('size')

# enumerate the samples within a group
sample_ranks = df.groupby(['time']).cumcount() * (2*np.pi) / num_samples

# compute the offset
x_offsets = np.where(num_samples.eq(1), 0, np.cos(df.sample_rank) * 0.03)
y_offsets = np.where(num_samples.eq(1), 0, np.sin(df.sample_rank) * 0.03)

# plot
axes = sns.scatterplot(x=df.time.cat.codes + x_offsets, 
                       y=df.sex.cat.codes + y_offsets,
                       size=df.tip)

Output:

这篇关于具有两个类别变量的Matplotlib点图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆