如何在Matplotlib中调试散点图? [英] How to debug a scatter plot in Matplotlib?

查看:36
本文介绍了如何在Matplotlib中调试散点图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下df:

df = pd.DataFrame([['A','X','2020-10-01',1],['A', 'X', '2020-10-02', 2],['A','X','2020-10-03',3],['A','Y','2020-10-01',4],['A','Y','2020-10-02',5],['A','Y','2020-10-03',6],['B', 'Z', '2020-10-01', 7],['B', 'Z', '2020-10-02', 8],['B','Z','2020-10-03',9],['B','Z','2020-10-01',10],['B', 'Z', '2020-10-02', 11],['B','Z','2020-10-03',12],],列= ['Q','W','DT','V'])

我想创建一个散点图:

  fig,ax = plt.subplots(figsize =(12,8),frameon = False)fig.suptitle('Plotz', fontsize=16)ax.set_title('DF 图')ax.scatter(x=df.DT, y=df.W, s=df.V)

这创建了以下图表:

我想弄清楚实际发生了什么,因为图形上有9个数据点,而数据中有12个数据点.注释图表不起作用,它将用 2 个值注释第一行.

for i, txt in enumerate(df.V):ax.annotate(txt, (df.DT[i], df.W[i]), fontsize=14)

当 x,y 对有多个值时(例如在这种情况下),有没有办法弄清楚引擎盖下到底发生了什么?

更新:也许我不清楚.在这种情况下,Matplotlib 的默认行为是什么?它是最后的胜利吗?如何在绘图上显示实际值?(这与显示两个值的注释代码不同,它显示了图中的真实值).

在更多地搜索之后,我认为答案是

另一种方法是增加抖动:向每个点位置添加一些小的随机噪声.对于数值数据,可以直接将抖动添加到数据中.如果是分类数据,则可以在调用 scatter :

之后修改位置

 将numpy导入为np点=轴散点图(x = df.DT,y = df.W,s = df.V)偏移量= dots.get_offsets()jittered_offsets =偏移量+ np.random.uniform(-0.1,0.1,offsets.shape)dots.set_offsets(抖动偏移)

使用原始的颜色和大小,并且没有alpha值,这显然会引起人们对重叠点的关注:

另一种方法,如果两个轴都是分类轴,则仅根据位置计数并圈出多次出现的位置:

导入集合点 = ax.scatter(x=df.DT, y=df.W, s=df.V)偏移量 = dots.get_offsets()counts = collections.Counter([(x,y为x,y的偏移量])嫌疑人 = [p for p in counts if counts[p] >= 2]ax.scatter([x 代表 x, _ 在嫌疑人中], [y 代表 _, y 在嫌疑人中], ec='crimson', lw=1, fc='none', s=50)

当然,可以根据实际数据的具体情况组合不同的方法(alpha、颜色、抖动、环绕).

I have the following df:

df = pd.DataFrame([
    ['A', 'X', '2020-10-01', 1],
    ['A', 'X', '2020-10-02', 2], 
    ['A', 'X', '2020-10-03', 3], 
    ['A', 'Y', '2020-10-01', 4],
    ['A', 'Y', '2020-10-02', 5], 
    ['A', 'Y', '2020-10-03', 6],
    ['B', 'Z', '2020-10-01', 7],
    ['B', 'Z', '2020-10-02', 8], 
    ['B', 'Z', '2020-10-03', 9], 
    ['B', 'Z', '2020-10-01', 10],
    ['B', 'Z', '2020-10-02', 11], 
    ['B', 'Z', '2020-10-03', 12],    
],
    columns=['Q', 'W', 'DT', 'V']
)

I would like to create a scatter plot:

fig, ax = plt.subplots(figsize=(12, 8), frameon=False)
fig.suptitle('Plotz', fontsize=16)
ax.set_title('DF Plot')
ax.scatter(x=df.DT, y=df.W, s=df.V)

This created the following chart:

I would like to figure out what actually happens, since there are 9 datapoints on the graph while there are 12 datapoints in the data. Annotating the chart does not work, it will annotate with 2 values for the top row.

for i, txt in enumerate(df.V):
    ax.annotate(txt, (df.DT[i], df.W[i]), fontsize=14)

Is there a way to figure out what really happens under the hood when there are multiple values for the x,y pair (like in this case)?

Update: Maybe I was not clear. What is the default behaviour of Matplotlib in this scenario? Is it last value wins? How could I display on the plot the actual value? (That shows the real value on the plot unlike the annotate code that shows both values).

After googling more around I think is the answer:

Visualization of scatter plots with overlapping points in matplotlib

解决方案

What normally happens, is that the dots are plotted in the order they are encountered, one over the other. If there is no transparency, the last one plotted will be visible, and the earlier ones will only show some border in case they were larger.

Therefore, one approach to debug this kind of situation, is to set an alpha value making the dots transparent. Multiple dots over each other will show darker and have some border.

With the given the testdata, the code below blows up the size and sets an alpha. As the dot size becomes extremely large, the axes limits need to be adjusted. Using multiple colors would emphasize the overlapping even more.

ax.scatter(x=df.DT, y=df.W, s=df.V*150, alpha=0.4)
plt.xlim(-1,3)
plt.ylim(-1,3)

Another approach, is adding jitter: adding some small random noise to each dot position. In case of numerical data, one can add the jitter directly to the data. In case of categorical data, the positions could be modified after calling scatter:

import numpy as np
dots = ax.scatter(x=df.DT, y=df.W, s=df.V)
offsets =  dots.get_offsets()
jittered_offsets = offsets + np.random.uniform(-0.1, 0.1, offsets.shape)
dots.set_offsets(jittered_offsets)

With the original colors and sizes, and without alpha, this would clearly draw the attention to dots that overlapped:

Still another approach, in case both axes are categorical, is to just count based on position and encircle the positions that appear multiple times:

import collections
dots = ax.scatter(x=df.DT, y=df.W, s=df.V)
offsets =  dots.get_offsets()
counts = collections.Counter([(x,y)  for x, y in offsets])
suspects = [p for p in counts if counts[p] >= 2]
ax.scatter([x for x, _ in suspects], [y for _, y in suspects], ec='crimson', lw=1, fc='none', s=50)

Of course, the different approaches (alpha, colors, jittering, encircling) can be combined depending on the specifics of the actual data.

这篇关于如何在Matplotlib中调试散点图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆