Matplotlib中的散点图轮廓 [英] Scatterplot Contours In Matplotlib
问题描述
我在matplotlib中生成了一个巨大的散点图(〜100,000点).每个点在此x/y空间中都有一个位置,我想生成包含点总数百分比的轮廓.
I have a massive scatterplot (~100,000 points) that I'm generating in matplotlib. Each point has a location in this x/y space, and I'd like to generate contours containing certain percentiles of the total number of points.
matplotlib中是否有一个函数可以做到这一点?我已经研究了contour(),但是我必须编写自己的函数才能以这种方式工作.
Is there a function in matplotlib which will do this? I've looked into contour(), but I'd have to write my own function to work in this way.
谢谢!
推荐答案
基本上,您需要某种密度的估计.有多种方法可以做到这一点:
Basically, you're wanting a density estimate of some sort. There multiple ways to do this:
-
使用某种2D直方图(例如
matplotlib.pyplot.hist2d
或matplotlib.pyplot.hexbin
)(您也可以将结果显示为轮廓线,只需使用numpy.histogram2d
然后对结果数组进行轮廓绘制即可).
Use a 2D histogram of some sort (e.g.
matplotlib.pyplot.hist2d
ormatplotlib.pyplot.hexbin
) (You could also display the results as contours--just usenumpy.histogram2d
and then contour the resulting array.)
进行内核密度估计(KDE)并绘制结果轮廓. KDE本质上是平滑的直方图.它不会使点落入特定的容器中,而是会向周围的容器中增加权重(通常呈高斯钟形曲线"的形状).
Make a kernel-density estimate (KDE) and contour the results. A KDE is essentially a smoothed histogram. Instead of a point falling into a particular bin, it adds a weight to surrounding bins (usually in the shape of a gaussian "bell curve").
使用2D直方图简单易懂,但从根本上给出了块状"结果.
Using a 2D histogram is simple and easy to understand, but fundementally gives "blocky" results.
正确"地进行第二个操作有些皱纹(即没有一种正确的方法).我不会在这里详细介绍,但是如果您想对结果进行统计解释,则需要仔细阅读(尤其是带宽选择).
There are some wrinkles to doing the second one "correctly" (i.e. there's no one correct way). I won't go into the details here, but if you want to interpret the results statistically, you need to read up on it (particularly the bandwidth selection).
无论如何,这是差异的一个例子.我将以相似的方式绘制每个图,因此我不会使用等高线,但是您可以使用等高线图轻松地绘制2D直方图或高斯KDE:
At any rate, here's an example of the differences. I'm going to plot each one similarly, so I won't use contours, but you could just as easily plot the 2D histogram or gaussian KDE using a contour plot:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kde
np.random.seed(1977)
# Generate 200 correlated x,y points
data = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 3]], 200)
x, y = data.T
nbins = 20
fig, axes = plt.subplots(ncols=2, nrows=2, sharex=True, sharey=True)
axes[0, 0].set_title('Scatterplot')
axes[0, 0].plot(x, y, 'ko')
axes[0, 1].set_title('Hexbin plot')
axes[0, 1].hexbin(x, y, gridsize=nbins)
axes[1, 0].set_title('2D Histogram')
axes[1, 0].hist2d(x, y, bins=nbins)
# Evaluate a gaussian kde on a regular grid of nbins x nbins over data extents
k = kde.gaussian_kde(data.T)
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
axes[1, 1].set_title('Gaussian KDE')
axes[1, 1].pcolormesh(xi, yi, zi.reshape(xi.shape))
fig.tight_layout()
plt.show()
一个警告:scipy.stats.gaussian_kde
具有很多点,会变得非常慢.通过逼近来加速它很容易-只需获取2D直方图,然后使用具有正确半径和协方差的高斯滤波器对它进行模糊处理即可.如果您愿意,我可以举个例子.
One caveat: With very large numbers of points, scipy.stats.gaussian_kde
will become very slow. It's fairly easy to speed it up by making an approximation--just take the 2D histogram and blur it with a guassian filter of the right radius and covariance. I can give an example if you'd like.
另一个警告:如果您是在非笛卡尔坐标系中执行此操作,则这些方法均不适用!在球形壳体上获取密度估计值要复杂一些.
One other caveat: If you're doing this in a non-cartesian coordinate system, none of these methods apply! Getting density estimates on a spherical shell is a bit more complicated.
这篇关于Matplotlib中的散点图轮廓的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!