通过对3D阵列进行采样和存储创建热图 [英] Creating a heatmap by sampling and bucketing from a 3D array

查看:111
本文介绍了通过对3D阵列进行采样和存储创建热图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些实验数据,如下所示:

I have some experimental data that exists like so:

x = array([1, 1.12, 1.109, 2.1, 3, 4.104, 3.1, ...])
y = array([-9, -0.1, -9.2, -8.7, -5, -4, -8.75, ...])
z = array([10, 4, 1, 4, 5, 0, 1, ...])

如果方便的话,我们可以假设数据以3D数组或大熊猫DataFrame:

If it's convenient, we can assume that the data exists as a 3D array or even a pandas DataFrame:

df = pd.DataFrame({'x': x, 'y': y, 'z': z})

的解释是,对于每个位置x[i], y[i],某个变量的值均为z[i].这些部分采样不均匀,因此会有一些部分被密集采样"(例如,x中的1和1.2之间),而另一些部分则非常稀疏(例如,<2中的2和3之间). c3>).因此,我不能只是将它们插入pcolormeshcontourf.

The interpretation being, for every position x[i], y[i], the value of some variable is z[i]. These are not evenly sampled, so there will be some parts that are "densely sampled" (e.g. between 1 and 1.2 in x) and others that are very sparse (e.g. between 2 and 3 in x). Because of this, I can't just chuck these into a pcolormesh or contourf.

我想做的是以一定的固定间隔均匀地重新采样xy,然后合计z的值.对于我的需求,可以对z求和或取平均值以获得有意义的值,因此这不是问题.我的天真尝试是这样的:

What I would like to do instead is to resample x and y evenly at some fixed interval and then aggregate the values of z. For my needs, z can be summed or averaged to get meaningful values, so this is not a problem. My naïve attempt was like this:

X = np.arange(min(x), max(x), 0.1)  
Y = np.arange(min(y), max(y), 0.1)
x_g, y_g = np.meshgrid(X, Y)
nx, ny = x_g.shape
z_g = np.full(x_g.shape, np.nan)

for ix in range(nx - 1):
    for jx in range(ny - 1):
        x_min = x_g[ix, jx]
        x_max = x_g[ix + 1, jx + 1]
        y_min = y_g[ix, jx]
        y_max = y_g[ix + 1, jx + 1]
        vals = df[(df.x >= x_min) & (df.x < x_max) & 
                  (df.y >= y_min) & (df.y < y_max)].z.values
        if vals.any():
            z_g[ix, jx] = sum(vals)

这可以正常工作,我可以通过plt.contourf(x_g, y_g, z_g)得到我想要的输出,但是它很慢!我有大约2万个样本,然后再将其分为x的约800个样本和y的约500个样本,这意味着for循环的长度为400k.

This works and I get the output I desire, with plt.contourf(x_g, y_g, z_g) but it is SLOW! I have ~20k samples, which I then subsample into ~800 samples in x and ~500 in y, meaning the for loop is 400k long.

有什么方法可以对此进行矢量化/优化?如果已经有一些功能可以做到这一点,那就更好了!

Is there any way to vectorize/optimize this? Even better if there is some function that already does this!

(也将其标记为MATLAB,因为numpy/MATLAB之间的语法非常相似,并且我可以使用这两种软件.)

(Also tagging this as MATLAB because the syntax between numpy/MATLAB are very similar and I have access to both software.)

推荐答案

这是采用 NumPy broadcasting matrix multiplication

Here's a vectorized Python solution employing NumPy broadcasting and matrix multiplication with np.dot for the sum-reduction part -

x_mask = ((x >= X[:-1,None]) & (x < X[1:,None]))
y_mask = ((y >= Y[:-1,None]) & (y < Y[1:,None]))

z_g_out = np.dot(y_mask*z[None].astype(np.float32), x_mask.T)

# If needed to fill invalid places with NaNs
z_g_out[y_mask.dot(x_mask.T.astype(np.float32))==0] = np.nan

请注意,我们在此处避免使用meshgrid.因此,将内存保存为meshgrid创建的网格将是巨大的,并有望在此过程中提高性能.

Note that we are avoiding the use of meshgrid there. Thus, saving memory there as the meshes created with meshgrid would be huge and in the process hopefully gaining performance improvement.

# Original app
def org_app(x,y,z):    
    X = np.arange(min(x), max(x), 0.1)  
    Y = np.arange(min(y), max(y), 0.1)
    x_g, y_g = np.meshgrid(X, Y)
    nx, ny = x_g.shape
    z_g = np.full(np.asarray(x_g.shape)-1, np.nan)

    for ix in range(nx - 1):
        for jx in range(ny - 1):
            x_min = x_g[ix, jx]
            x_max = x_g[ix + 1, jx + 1]
            y_min = y_g[ix, jx]
            y_max = y_g[ix + 1, jx + 1]
            vals = z[(x >= x_min) & (x < x_max) & 
                      (y >= y_min) & (y < y_max)]
            if vals.any():
                z_g[ix, jx] = sum(vals)
    return z_g

# Proposed app
def app1(x,y,z):
    X = np.arange(min(x), max(x), 0.1)  
    Y = np.arange(min(y), max(y), 0.1)
    x_mask = ((x >= X[:-1,None]) & (x < X[1:,None]))
    y_mask = ((y >= Y[:-1,None]) & (y < Y[1:,None]))

    z_g_out = np.dot(y_mask*z[None].astype(np.float32), x_mask.T)

    # If needed to fill invalid places with NaNs
    z_g_out[y_mask.dot(x_mask.T.astype(np.float32))==0] = np.nan
    return z_g_out

如所见,为了公平地进行基准测试,我将数组值与原始方法一起使用,因为从数据帧中获取值可能会减慢速度.

As seen, for a fair benchmarking, I am using array values with the original approach, as fetching values from a dataframe could slow things down.

时间和验证-

In [143]: x = np.array([1, 1.12, 1.109, 2.1, 3, 4.104, 3.1])
     ...: y = np.array([-9, -0.1, -9.2, -8.7, -5, -4, -8.75])
     ...: z = np.array([10, 4, 1, 4, 5, 0, 1])
     ...: 

# Verify outputs
In [150]: np.nansum(np.abs(org_app(x,y,z) - app1(x,y,z)))
Out[150]: 0.0

In [145]: %timeit org_app(x,y,z)
10 loops, best of 3: 19.9 ms per loop

In [146]: %timeit app1(x,y,z)
10000 loops, best of 3: 39.1 µs per loop

In [147]: 19900/39.1  # Speedup figure
Out[147]: 508.95140664961633

这篇关于通过对3D阵列进行采样和存储创建热图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆