将散点图分配到特定的箱中 [英] Allocate scatter plot into specific bins

查看:79
本文介绍了将散点图分配到特定的箱中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个scatter plot,它被分类为4 Bins.这些在中间用两个arcs和一个line分隔(请参见下图).

两个arcs略有问题.如果X-Coordiante大于ang2,则不会归因于正确的Bin. (请参见下图)

import math
import matplotlib.pyplot as plt
import matplotlib as mpl

X = [24,15,71,72,6,13,77,52,52,62,46,43,31,35,41]  
Y = [94,61,76,83,69,86,78,57,45,94,82,74,56,70,94]      

fig, ax = plt.subplots()
ax.set_xlim(-100,100)
ax.set_ylim(-40,140)
ax.grid(False)

plt.scatter(X,Y)

#middle line
BIN_23_X = 0 
#two arcs
ang1 = -60, 60
ang2 = 60, 60
angle = math.degrees(math.acos(2/9.15))
E_xy = 0,60

Halfway = mpl.lines.Line2D((BIN_23_X,BIN_23_X), (0,125), color = 'white', lw = 1.5, alpha = 0.8, zorder = 1)
arc1 = mpl.patches.Arc(ang1, 70, 110, angle = 0, theta2 = angle, theta1 = 360-angle, color = 'white', lw = 2)
arc2 = mpl.patches.Arc(ang2, 70, 110, angle = 0, theta2 = 180+angle, theta1 = 180-angle, color = 'white', lw = 2)
Oval = mpl.patches.Ellipse(E_xy, 160, 130, lw = 3, edgecolor = 'black', color = 'white', alpha = 0.2)

ax.add_line(Halfway)
ax.add_patch(arc1)
ax.add_patch(arc2)
ax.add_patch(Oval)

#Sorting the coordinates into bins   
def get_nearest_arc_vert(x, y, arc_vertices):
err = (arc_vertices[:,0] - x)**2 + (arc_vertices[:,1] - y)**2
nearest = (arc_vertices[err == min(err)])[0]
return nearest

arc1v = ax.transData.inverted().transform(arc1.get_verts())
arc2v = ax.transData.inverted().transform(arc2.get_verts())

def classify_pointset(vx, vy):
    bins = {(k+1):[] for k in range(4)}
    for (x,y) in zip(vx, vy):
        nx1, ny1 = get_nearest_arc_vert(x, y, arc1v)
        nx2, ny2 = get_nearest_arc_vert(x, y, arc2v)

        if x < nx1:                         
            bins[1].append((x,y))
        elif x > nx2:                      
            bins[4].append((x,y))
        else:
            if x < BIN_23_X:               
                bins[2].append((x,y))
            else:                          
               bins[3].append((x,y))
    return bins

#Bins Output
bins_red  = classify_pointset(X,Y)

all_points = [None] * 5
for bin_key in [1,2,3,4]:
    all_points[bin_key] = bins_red[bin_key] 

输出:

[[], [], [(24, 94), (15, 61), (71, 76), (72, 83), (6, 69), (13, 86), (77, 78), (62, 94)], [(52, 57), (52, 45), (46, 82), (43, 74), (31, 56), (35, 70), (41, 94)]]

这不太正确.查看下面的figure output4 coordinatesBin 3中,而11Bin 4中.但是8属于Bin 3,而7属于Bin 4.

我认为问题是blue coordinates.具体而言,当X-Coordinate大于ang2时,即60.如果我将这些值更改为小于60,它们将被更正为Bin 3.

我不确定是否应该扩展arcs大于60或是否可以改进代码?

请注意,这仅适用于Bin 4ang2. Bin 1ang1会出现此问题.也就是说,如果X-Cooridnate 小于60 ,则不会将其归因于Bin 1

预期输出:

[[], [], [(24, 94), (15, 61), (6, 69), (13, 86)], [(71, 76), (72, 83), (52, 57), (52, 45), (46, 82), (43, 74), (31, 56), (35, 70), (41, 94), (77, 78), (62, 94)]]

注意:首选预期的输出.该示例使用一个row输入数据.但是,我的数据集更大.如果我们使用大量的rows,则输出应逐行显示.例如

#Numerous rows
X = np.random.randint(50, size=(100, 10))
Y = np.random.randint(80, size=(100, 10)) 

出局:

Row 0 = [(x,y)],[(x,y)],[(x,y)],[(x,y)]
Row 1 = [(x,y)],[(x,y)],[(x,y)],[(x,y)]
Row 2 = [(x,y)],[(x,y)],[(x,y)],[(x,y)]
etc

解决方案

补丁对是否包含点进行了测试:contains_point甚至对点数组进行了测试:contains_points

仅此而已,我为您提供了一个代码段,您可以在添加补丁的部分和#Sorting the coordinates into bins代码块之间添加该代码段.

它添加了两个附加的(透明)椭圆,以计算如果圆弧是完全封闭的椭圆,圆弧是否将包含点.那么,如果某点属于大椭圆形,左或右省略号或x坐标为正或负,则bin计算只是测试的布尔组合.

ov1 = mpl.patches.Ellipse(ang1, 70, 110, alpha=0)
ov2 = mpl.patches.Ellipse(ang2, 70, 110, alpha=0)
ax.add_patch(ov1)
ax.add_patch(ov2)

for px, py in zip(X, Y):
    in_oval = Oval.contains_point(ax.transData.transform(([px, py])), 0)
    in_left = ov1.contains_point(ax.transData.transform(([px, py])), 0)
    in_right = ov2.contains_point(ax.transData.transform(([px, py])), 0)
    on_left = px < 0
    on_right = px > 0
    if in_oval:
        if in_left:
            n_bin = 1
        elif in_right:
            n_bin = 4
        elif on_left:
            n_bin = 2
        elif on_right:
            n_bin = 3
        else:
            n_bin = -1
    else:
        n_bin = -1
    print('({:>2}/{:>2}) is {}'.format(px, py, 'in Bin ' +str(n_bin) if n_bin>0 else 'outside'))

输出为:

(24/94) is in Bin 3
(15/61) is in Bin 3
(71/76) is in Bin 4
(72/83) is in Bin 4
( 6/69) is in Bin 3
(13/86) is in Bin 3
(77/78) is outside
(52/57) is in Bin 4
(52/45) is in Bin 4
(62/94) is in Bin 4
(46/82) is in Bin 4
(43/74) is in Bin 4
(31/56) is in Bin 4
(35/70) is in Bin 4
(41/94) is in Bin 4

请注意,当点的x坐标= 0时,您仍然应该决定如何定义bin-在它们等于外部时,因为on_lefton_right都不会对它们负责... /p>

PS:感谢@ImportanceOfBeingErnest提供了必要转换的提示: https://stackoverflow.com/a/49112347/8300135

注意:对于以下所有编辑,您都需要 import numpy as np
编辑: 用于计算每个X, Y数组输入的bin分布的函数:

def bin_counts(X, Y):
    bc = dict()
    E = Oval.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_l = ov1.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_r = ov2.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    L = np.array(X) < 0
    R = np.array(X) > 0
    bc[1] = np.sum(E & E_l)
    bc[2] = np.sum(E & L & ~E_l)
    bc[3] = np.sum(E & R & ~E_r)
    bc[4] = np.sum(E & E_r)
    return bc

将导致以下结果:

bin_counts(X, Y)
Out: {1: 0, 2: 0, 3: 4, 4: 10}

X和Y的两个2D数组中有很多行:

np.random.seed(42)
X = np.random.randint(-80, 80, size=(100, 10))
Y = np.random.randint(0, 120, size=(100, 10))

循环遍历所有行:

for xr, yr in zip(X, Y):
    print(bin_counts(xr, yr))

结果:

{1: 1, 2: 2, 3: 6, 4: 0}
{1: 1, 2: 0, 3: 4, 4: 2}
{1: 5, 2: 2, 3: 1, 4: 1}
...
{1: 3, 2: 2, 3: 2, 4: 0}
{1: 2, 2: 4, 3: 1, 4: 1}
{1: 1, 2: 1, 3: 6, 4: 2}

为了不返回每个仓中的点数,而是返回包含四个数组的数组,四个数组包含每个仓中的点的x,y坐标,请使用以下命令:

X = [24,15,71,72,6,13,77,52,52,62,46,43,31,35,41]  
Y = [94,61,76,83,69,86,78,57,45,94,82,74,56,70,94]      

def bin_points(X, Y):
    X = np.array(X)
    Y = np.array(Y)
    E = Oval.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_l = ov1.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_r = ov2.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    L = X < 0
    R = X > 0
    bp1 = np.array([X[E & E_l], Y[E & E_l]]).T
    bp2 = np.array([X[E & L & ~E_l], Y[E & L & ~E_l]]).T
    bp3 = np.array([X[E & R & ~E_r], Y[E & R & ~E_r]]).T
    bp4 = np.array([X[E & E_r], Y[E & E_r]]).T
    return [bp1, bp2, bp3, bp4]

print(bin_points(X, Y))
[array([], shape=(0, 2), dtype=int32), array([], shape=(0, 2), dtype=int32), array([[24, 94],
       [15, 61],
       [ 6, 69],
       [13, 86]]), array([[71, 76],
       [72, 83],
       [52, 57],
       [52, 45],
       [62, 94],
       [46, 82],
       [43, 74],
       [31, 56],
       [35, 70],
       [41, 94]])]

...同样,要将其应用于大型2D阵列,只需对其进行迭代:

np.random.seed(42)
X = np.random.randint(-100, 100, size=(100, 10))
Y = np.random.randint(-40, 140, size=(100, 10))

bincol = ['r', 'g', 'b', 'y', 'k']

for xr, yr in zip(X, Y):
    for i, binned_points in enumerate(bin_points(xr, yr)):
        ax.scatter(*binned_points.T, c=bincol[i], marker='o' if i<4 else 'x')

I have a scatter plot that gets sorted into 4 Bins. These are separated by two arcs and a line in the middle (see figure below).

There's a slight problem with the two arcs. If the X-Coordiante is greater than the ang2 it doesn't get attributed to the correct Bin. (Please see figure below)

import math
import matplotlib.pyplot as plt
import matplotlib as mpl

X = [24,15,71,72,6,13,77,52,52,62,46,43,31,35,41]  
Y = [94,61,76,83,69,86,78,57,45,94,82,74,56,70,94]      

fig, ax = plt.subplots()
ax.set_xlim(-100,100)
ax.set_ylim(-40,140)
ax.grid(False)

plt.scatter(X,Y)

#middle line
BIN_23_X = 0 
#two arcs
ang1 = -60, 60
ang2 = 60, 60
angle = math.degrees(math.acos(2/9.15))
E_xy = 0,60

Halfway = mpl.lines.Line2D((BIN_23_X,BIN_23_X), (0,125), color = 'white', lw = 1.5, alpha = 0.8, zorder = 1)
arc1 = mpl.patches.Arc(ang1, 70, 110, angle = 0, theta2 = angle, theta1 = 360-angle, color = 'white', lw = 2)
arc2 = mpl.patches.Arc(ang2, 70, 110, angle = 0, theta2 = 180+angle, theta1 = 180-angle, color = 'white', lw = 2)
Oval = mpl.patches.Ellipse(E_xy, 160, 130, lw = 3, edgecolor = 'black', color = 'white', alpha = 0.2)

ax.add_line(Halfway)
ax.add_patch(arc1)
ax.add_patch(arc2)
ax.add_patch(Oval)

#Sorting the coordinates into bins   
def get_nearest_arc_vert(x, y, arc_vertices):
err = (arc_vertices[:,0] - x)**2 + (arc_vertices[:,1] - y)**2
nearest = (arc_vertices[err == min(err)])[0]
return nearest

arc1v = ax.transData.inverted().transform(arc1.get_verts())
arc2v = ax.transData.inverted().transform(arc2.get_verts())

def classify_pointset(vx, vy):
    bins = {(k+1):[] for k in range(4)}
    for (x,y) in zip(vx, vy):
        nx1, ny1 = get_nearest_arc_vert(x, y, arc1v)
        nx2, ny2 = get_nearest_arc_vert(x, y, arc2v)

        if x < nx1:                         
            bins[1].append((x,y))
        elif x > nx2:                      
            bins[4].append((x,y))
        else:
            if x < BIN_23_X:               
                bins[2].append((x,y))
            else:                          
               bins[3].append((x,y))
    return bins

#Bins Output
bins_red  = classify_pointset(X,Y)

all_points = [None] * 5
for bin_key in [1,2,3,4]:
    all_points[bin_key] = bins_red[bin_key] 

Output:

[[], [], [(24, 94), (15, 61), (71, 76), (72, 83), (6, 69), (13, 86), (77, 78), (62, 94)], [(52, 57), (52, 45), (46, 82), (43, 74), (31, 56), (35, 70), (41, 94)]]

This isn't quite right. Looking at the figure output below, 4 coordinates are in Bin 3 and 11 are in Bin 4. But 8 are attributed to Bin 3 and 7 are attributed to Bin 4.

I think the problem is the blue coordinates. Specifically, when the X-Coordinate is greater than ang2, which is 60. If I alter these to be less than 60 they will be corrected into Bin 3.

I'm not sure if I should extend the arcs to be greater than 60 or if the code can be improved?

Please note this is just for Bin 4 and ang2. The issue will occur for Bin 1 and ang1. That is, if the X-Cooridnate is less than 60 it won't get attributed to Bin 1

Intended Output:

[[], [], [(24, 94), (15, 61), (6, 69), (13, 86)], [(71, 76), (72, 83), (52, 57), (52, 45), (46, 82), (43, 74), (31, 56), (35, 70), (41, 94), (77, 78), (62, 94)]]

Note: The intended output is preferred. The example uses one row of input data. However, my dataset is much larger. If we use numerous rows the output should be row by row. e.g

#Numerous rows
X = np.random.randint(50, size=(100, 10))
Y = np.random.randint(80, size=(100, 10)) 

Out:

Row 0 = [(x,y)],[(x,y)],[(x,y)],[(x,y)]
Row 1 = [(x,y)],[(x,y)],[(x,y)],[(x,y)]
Row 2 = [(x,y)],[(x,y)],[(x,y)],[(x,y)]
etc

解决方案

Patches have a test for containing points or not: contains_point and even for arrays of points:contains_points

Just to play with I have a code snippet for you, which you can add between the part where you're adding your patches and the #Sorting the coordinates into bins codeblock.

It adds two additional (transparent) ellipses for calculating if the arcs would contain points if they were fully closed ellipses. Then your bin calculation is just a boolean combination of tests if a point belongs to the big oval, the left or right ellipsis or has positive or negative x-coordinate.

ov1 = mpl.patches.Ellipse(ang1, 70, 110, alpha=0)
ov2 = mpl.patches.Ellipse(ang2, 70, 110, alpha=0)
ax.add_patch(ov1)
ax.add_patch(ov2)

for px, py in zip(X, Y):
    in_oval = Oval.contains_point(ax.transData.transform(([px, py])), 0)
    in_left = ov1.contains_point(ax.transData.transform(([px, py])), 0)
    in_right = ov2.contains_point(ax.transData.transform(([px, py])), 0)
    on_left = px < 0
    on_right = px > 0
    if in_oval:
        if in_left:
            n_bin = 1
        elif in_right:
            n_bin = 4
        elif on_left:
            n_bin = 2
        elif on_right:
            n_bin = 3
        else:
            n_bin = -1
    else:
        n_bin = -1
    print('({:>2}/{:>2}) is {}'.format(px, py, 'in Bin ' +str(n_bin) if n_bin>0 else 'outside'))

The output is:

(24/94) is in Bin 3
(15/61) is in Bin 3
(71/76) is in Bin 4
(72/83) is in Bin 4
( 6/69) is in Bin 3
(13/86) is in Bin 3
(77/78) is outside
(52/57) is in Bin 4
(52/45) is in Bin 4
(62/94) is in Bin 4
(46/82) is in Bin 4
(43/74) is in Bin 4
(31/56) is in Bin 4
(35/70) is in Bin 4
(41/94) is in Bin 4

Note you still should decide how to define bins when points have x-coord=0 - at the moment they're equal to outside, as on_left and on_rightboth do not feel responsible for them...

PS: Thanks to @ImportanceOfBeingErnest for the hint to the necessary transformation: https://stackoverflow.com/a/49112347/8300135

Note: for all the following EDITS you'll need to import numpy as np
EDIT: Function for counting the bin distribution per X, Y array input:

def bin_counts(X, Y):
    bc = dict()
    E = Oval.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_l = ov1.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_r = ov2.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    L = np.array(X) < 0
    R = np.array(X) > 0
    bc[1] = np.sum(E & E_l)
    bc[2] = np.sum(E & L & ~E_l)
    bc[3] = np.sum(E & R & ~E_r)
    bc[4] = np.sum(E & E_r)
    return bc

Will lead to this result:

bin_counts(X, Y)
Out: {1: 0, 2: 0, 3: 4, 4: 10}

EDIT2: many rows in two 2D-arrays for X and Y:

np.random.seed(42)
X = np.random.randint(-80, 80, size=(100, 10))
Y = np.random.randint(0, 120, size=(100, 10))

looping over all the rows:

for xr, yr in zip(X, Y):
    print(bin_counts(xr, yr))

result:

{1: 1, 2: 2, 3: 6, 4: 0}
{1: 1, 2: 0, 3: 4, 4: 2}
{1: 5, 2: 2, 3: 1, 4: 1}
...
{1: 3, 2: 2, 3: 2, 4: 0}
{1: 2, 2: 4, 3: 1, 4: 1}
{1: 1, 2: 1, 3: 6, 4: 2}

EDIT3: for returning not the number of points in each bin, but an array with four arrays containing the x,y-coordinates of the points in each bin, use the following:

X = [24,15,71,72,6,13,77,52,52,62,46,43,31,35,41]  
Y = [94,61,76,83,69,86,78,57,45,94,82,74,56,70,94]      

def bin_points(X, Y):
    X = np.array(X)
    Y = np.array(Y)
    E = Oval.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_l = ov1.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    E_r = ov2.contains_points(ax.transData.transform(np.array([X, Y]).T), 0)
    L = X < 0
    R = X > 0
    bp1 = np.array([X[E & E_l], Y[E & E_l]]).T
    bp2 = np.array([X[E & L & ~E_l], Y[E & L & ~E_l]]).T
    bp3 = np.array([X[E & R & ~E_r], Y[E & R & ~E_r]]).T
    bp4 = np.array([X[E & E_r], Y[E & E_r]]).T
    return [bp1, bp2, bp3, bp4]

print(bin_points(X, Y))
[array([], shape=(0, 2), dtype=int32), array([], shape=(0, 2), dtype=int32), array([[24, 94],
       [15, 61],
       [ 6, 69],
       [13, 86]]), array([[71, 76],
       [72, 83],
       [52, 57],
       [52, 45],
       [62, 94],
       [46, 82],
       [43, 74],
       [31, 56],
       [35, 70],
       [41, 94]])]

...and again, for applying this to the big 2D-arrays, just iterate over them:

np.random.seed(42)
X = np.random.randint(-100, 100, size=(100, 10))
Y = np.random.randint(-40, 140, size=(100, 10))

bincol = ['r', 'g', 'b', 'y', 'k']

for xr, yr in zip(X, Y):
    for i, binned_points in enumerate(bin_points(xr, yr)):
        ax.scatter(*binned_points.T, c=bincol[i], marker='o' if i<4 else 'x')

这篇关于将散点图分配到特定的箱中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆