识别由左侧和右侧的不同增量链接的聚类 [英] Identify clusters linked by delta to the left and different delta to the right

查看:44
本文介绍了识别由左侧和右侧的不同增量链接的聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑排序后的数组a:

a = np.array([0, 2, 3, 4, 5, 10, 11, 11, 14, 19, 20, 20])

如果我指定了左右增量,

delta_left, delta_right = 1, 1

这就是我期望分配集群的方式:

#   a = [ 0  .  2  3  4  5  .  .  .  . 10 11  .  . 14  .  .  .  . 19 20
#                                         11                         20
#
#                                     [10--|-12]                 [19--|-21]
#           [1--|--3]                 [10--|-12]                 [19--|-21]
#    [-1--|--1]   [3--|--5]         [9--|-11]                 [18--|-20]
#   +--+--|--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
#              [2--|--4]                       [13--|-15]
#
#         │    ╰──┬───╯                 ╰┬─╯        │              ╰┬─╯
#         │   cluster 2                Cluster 3    │           Cluster 5
#     Cluster 1                                 Cluster 4

注意: 尽管间隔[-1, 1][1, 3]共享一条边,但是两个间隔都不包含相邻点,因此不构成其各自簇的结合. /p>

假设群集分配存储在名为clusters的数组中,我希望结果看起来像这样

print(clusters)
[1 2 2 2 2 3 3 3 4 5 5 5]


但是,假设我将左右增量更改为其他值:

delta_left, delta_right = 2, 1

这意味着对于x值,它应与间隔[x - 2, x + 1]

中的任何其他点组合

#   a = [ 0  .  2  3  4  5  .  .  .  . 10 11  .  . 14  .  .  .  . 19 20
#                                         11                         20
#
#                                   [9-----|-12]              [18-----|-21]
#        [0-----|--3]               [9-----|-12]              [18-----|-21]
# [-2-----|--1][2-----|--5]      [8-----|-11]              [17-----|-20]
#   +--+--|--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
#           [1 ----|--4]                    [12-----|-15]
#
#         ╰─────┬─────╯                 ╰┬─╯        │              ╰┬─╯
#           cluster 1                Cluster 2      │           Cluster 4
#                                               Cluster 3

注意: :尽管间隔[9, 12][12, 15]共享一条边,但是两个间隔都不包含相邻点,因此不构成其各自簇的结合. /p>

假设群集分配存储在名为clusters的数组中,我希望结果看起来像这样:

print(clusters)
[1 1 1 1 1 2 2 2 3 4 4 4]

解决方案

我们将利用 np.searchsorted 和查找簇边缘的逻辑.

首先,让我们仔细看看np.searchsorted的作用:

将索引查找到排序数组a中,这样,如果将v中的相应元素插入到索引之前,则将保留a的顺序.

我要做的是使用a - delta_lefta执行np.searchsorted.让我们来看看delta_left = 1

# a =
# [ 0  2  3  4  5 10 11 11 14 19 20 20]
# 
# a - delta_left
# [-1  1  2  3  4  9 10 10 13 18 19 19]

  • -1将插入到位置0以保持秩序
  • 1将插入到位置1以便保持秩序
  • 2也将插入到位置1,表明2可能与1
  • 在同一群集中
  • 3将插入到位置2,表明3可能与2
  • 在同一群集中
  • 依此类推

我们注意到的是,只有在元素的当前位置插入较少delta的元素时,我们才会考虑从新的簇开始.

我们再次对右侧执行此操作,但有所不同.区别在于,默认情况下,如果一堆元素相同,则np.searchsorted假定插入值的前面.为了确定簇的末端,我将要插入相同的元素之后.因此,我将使用参数side='right'

如果为"left",则给出找到的第一个合适位置的索引.如果为正确",则返回最后一个这样的索引.如果没有合适的索引,则返回0或N(其中N是a的长度).

现在开始逻辑.只有第一个集群除外,如果先前的集群已结束,则集群才能开始.然后,我们将考虑第二个np.searchsorted

结果的偏移版本

现在让我们定义功能

def delta_cluster(a, dleft, dright):
    # use to track whether searchsorted results are at correct positions
    rng = np.arange(len(a))

    edge_left = a.searchsorted(a - dleft)
    starts = edge_left == rng

    # we append 0 to shift
    edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
    ends = edge_right == rng

    return (starts & ends).cumsum()


演示

左,右增量分别等于1和1

print(delta_cluster(a, 1, 1))

[1 2 2 2 2 3 3 3 4 5 5 5]

左,右增量分别等于2和1

print(delta_cluster(a, 2, 1))

[1 1 1 1 1 2 2 2 3 4 4 4]

额外信用
如果a不排序怎么办?
我将利用从这篇文章

中学到的信息

def delta_cluster(a, dleft, dright):

    s = a.argsort()

    size = s.size

    if size > 1000:
        y = np.empty(s.size, dtype=np.int64)
        y[s] = np.arange(s.size)
    else:
        y = s.argsort()

    a = a[s]

    rng = np.arange(len(a))

    edge_left = a.searchsorted(a - dleft)
    starts = edge_left == rng

    edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
    ends = edge_right == rng

    return (starts & ends).cumsum()[y]

演示

b = np.random.permutation(a)
print(b)

[14 10  3 11 20  0 19 20  4 11  5  2]


print(delta_cluster(a, 2, 1))

[1 1 1 1 1 2 2 2 3 4 4 4]


print(delta_cluster(b, 2, 1))

[3 2 1 2 4 1 4 4 1 2 1 1]


print(delta_cluster(b, 2, 1)[b.argsort()])

[1 1 1 1 1 2 2 2 3 4 4 4]

Consider the sorted array a:

a = np.array([0, 2, 3, 4, 5, 10, 11, 11, 14, 19, 20, 20])

If I specified left and right deltas,

delta_left, delta_right = 1, 1

Then this is how I'd expect the clusters to be assigned:

#   a = [ 0  .  2  3  4  5  .  .  .  . 10 11  .  . 14  .  .  .  . 19 20
#                                         11                         20
#
#                                     [10--|-12]                 [19--|-21]
#           [1--|--3]                 [10--|-12]                 [19--|-21]
#    [-1--|--1]   [3--|--5]         [9--|-11]                 [18--|-20]
#   +--+--|--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
#              [2--|--4]                       [13--|-15]
#
#         │    ╰──┬───╯                 ╰┬─╯        │              ╰┬─╯
#         │   cluster 2                Cluster 3    │           Cluster 5
#     Cluster 1                                 Cluster 4

NOTE: Despite the interval [-1, 1] sharing an edge with [1, 3], neither interval includes an adjacent point and therefore do not constitute joining their respective clusters.

Assuming the cluster assignments were stored in an array named clusters, I'd expect the results to look like this

print(clusters)
[1 2 2 2 2 3 3 3 4 5 5 5]


However, suppose I change the left and right deltas to be different:

delta_left, delta_right = 2, 1

This means that for a value of x it should be combined with any other point in the interval [x - 2, x + 1]

#   a = [ 0  .  2  3  4  5  .  .  .  . 10 11  .  . 14  .  .  .  . 19 20
#                                         11                         20
#
#                                   [9-----|-12]              [18-----|-21]
#        [0-----|--3]               [9-----|-12]              [18-----|-21]
# [-2-----|--1][2-----|--5]      [8-----|-11]              [17-----|-20]
#   +--+--|--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
#           [1 ----|--4]                    [12-----|-15]
#
#         ╰─────┬─────╯                 ╰┬─╯        │              ╰┬─╯
#           cluster 1                Cluster 2      │           Cluster 4
#                                               Cluster 3

NOTE: Despite the interval [9, 12] sharing an edge with [12, 15], neither interval includes an adjacent point and therefore do not constitute joining their respective clusters.

Assuming the cluster assignments were stored in an array named clusters, I'd expect the results to look like this:

print(clusters)
[1 1 1 1 1 2 2 2 3 4 4 4]

解决方案

We will leverage np.searchsorted and logic to find cluster edges.

First, let's take a closer look at what np.searchsorted does:

Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.

What I'll do is execute np.searchsorted with a using a - delta_left. Let's look at that for delta_left = 1

# a =
# [ 0  2  3  4  5 10 11 11 14 19 20 20]
# 
# a - delta_left
# [-1  1  2  3  4  9 10 10 13 18 19 19]

  • -1 would get inserted at position 0 to maintain order
  • 1 would get inserted at position 1 to maintain order
  • 2 would get inserted at position 1 as well, indicating that 2 might be in the same cluster as 1
  • 3 would get inserted at position 2 indicating that 3 might be in the same cluster as 2
  • so on and so forth

What we notice is that only when an element less delta would get inserted at its current position would we consider a new cluster starting.

We do this again for the right side with a difference. The difference is that by default if a bunch of elements are the same, np.searchsorted assumes to insert into the front of values. To identify the ends of clusters, I'm going to want to insert after the identical elements. Therefore I'll use the paramater side='right'

If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of a).

Now the logic. A cluster can only begin if a prior cluster has ended, with the exception of the first cluster. We'll then consider a shifted version of the results of our second np.searchsorted


Let's now define our function

def delta_cluster(a, dleft, dright):
    # use to track whether searchsorted results are at correct positions
    rng = np.arange(len(a))

    edge_left = a.searchsorted(a - dleft)
    starts = edge_left == rng

    # we append 0 to shift
    edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
    ends = edge_right == rng

    return (starts & ends).cumsum()


demonstration

with left, right deltas equal to 1 and 1

print(delta_cluster(a, 1, 1))

[1 2 2 2 2 3 3 3 4 5 5 5]

with left, right deltas equal to 2 and 1

print(delta_cluster(a, 2, 1))

[1 1 1 1 1 2 2 2 3 4 4 4]

Extra Credit
What if a isn't sorted?
I'll utilize information learned from this post

def delta_cluster(a, dleft, dright):

    s = a.argsort()

    size = s.size

    if size > 1000:
        y = np.empty(s.size, dtype=np.int64)
        y[s] = np.arange(s.size)
    else:
        y = s.argsort()

    a = a[s]

    rng = np.arange(len(a))

    edge_left = a.searchsorted(a - dleft)
    starts = edge_left == rng

    edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
    ends = edge_right == rng

    return (starts & ends).cumsum()[y]

demonstration

b = np.random.permutation(a)
print(b)

[14 10  3 11 20  0 19 20  4 11  5  2]


print(delta_cluster(a, 2, 1))

[1 1 1 1 1 2 2 2 3 4 4 4]


print(delta_cluster(b, 2, 1))

[3 2 1 2 4 1 4 4 1 2 1 1]


print(delta_cluster(b, 2, 1)[b.argsort()])

[1 1 1 1 1 2 2 2 3 4 4 4]

这篇关于识别由左侧和右侧的不同增量链接的聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆