在1s和0s的巨大列表中提取1s的密集区域的边界 [英] Extracting boundaries of dense regions of 1s in a huge list of 1s and 0s

查看:188
本文介绍了在1s和0s的巨大列表中提取1s的密集区域的边界的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定如何表达我的问题.但这是...

I'm not sure how to word my problem. But here it is...

我有一个1和0的庞大列表[总长​​度= 53820].

I have a huge list of 1s and 0s [Total length = 53820].

列表外观示例- [0,1,1,1,1,1,1,1,1,0,0,0,1,1,0,0,0,0,0,0,1,1...........]

下面是可视化效果.

x轴:元素的索引(从0到53820)

x-axis: index of the element (from 0 to 53820)

y轴:该索引处的值(即1或0)

y-axis: value at that index (i.e. 1 or 0)

输入图->

该图清楚地显示了3个密集区域,其中1s的出现率更高. 我已在图的顶部绘制以显示视觉上密集的区域. (情节上难看的黑线). 我想知道绘图上密集区域(开始和结束边界)的x轴上的索引号.

The plot clearly shows 3 dense areas where the occurrence of 1s is more. I have drawn on top of the plot to show the visually dense areas. (ugly black lines on the plot). I want to know the index numbers on the x-axis of the dense areas (start and end boundaries) on the plot.

我已经提取了1的块并将每个块的起始索引保存在名为"starts"的新列表中. 该函数将返回一个字典列表,如下所示:

I have extracting the chunks of 1s and saving the start indexes of each in a new list named 'starts'. That function returns a list of dictionaries like this:

{'start': 0, 'count': 15, 'end': 16}, {'start': 2138, 'count': 3, 'end': 2142}, {'start': 2142, 'count': 3, 'end': 2146}, {'start': 2461, 'count': 1, 'end': 2463}, {'start': 2479, 'count': 45, 'end': 2525}, {'start': 2540, 'count': 2, 'end': 2543}

然后在设置阈值后开始比较相邻的元素. 它返回了密集区域的明显边界.

Then in starts, after setting a threshold, compared adjacent elements. Which returns the apparent boundaries of the dense areas.

THR = 2000
    results = []
    cues = {'start': 0, 'stop': 0}  
    result,starts = densest(preds) # Function that returns the list of dictionaries shown above
    cuestart = False # Flag to check if looking for start or stop of dense boundary
    for i,j in zip(range(0,len(starts)), range(1,len(starts))):
        now = starts[i]
        nextf = starts[j]

        if(nextf-now > THR):
            if(cuestart == False):
                cues['start'] = nextf
                cues['stop'] = nextf
                cuestart = True

            elif(cuestart == True): # Cuestart is already set
                cues['stop'] = now
                cuestart = False
                results.append(cues)
                cues = {'start': 0, 'stop': 0}

    print('\n',results)

输出和相应的图看起来像这样.

The output and corresponding plot looks like this.

[{'start': 2138, 'stop': 6654}, {'start': 23785, 'stop': 31553}, {'start': 38765, 'stop': 38765}]

输出图->

该方法无法获得图中所示的最后一个密集区域,也无法获取相似种类的其他数据.

This method fails to get the last dense region as seen in the plot, and also for other data of similar sorts.

P.S.我也尝试过使用seaborn在此数据上使用"KDE"和"distplot",但这直接为我提供了图,而我无法从中提取边界值. 该问题的链接位于此处(获取密集区域边界KDE图输出中的值)

P.S. I have also tried 'KDE' on this data and 'distplot' using seaborn but that gives me plots directly and I am unable to extract the boundary values from that. The link for that question is here (Getting dense region boundary values from output of KDE plot)

推荐答案

好的,您需要一个答案...

OK, you need an answer...

首先,导入(我们将使用LineCollections)

First, the imports (we are going to use LineCollections)

import numpy as np ; import matplotlib.pyplot as plt ;                           
from matplotlib.collections import LineCollection                                

接下来,常量的定义

N = 1001 ; np.random.seed(20190515)                                              

以及伪造数据的产生

x = np.linspace(0,1, 1001)                                                       
prob = np.where(x<0.4, 0.02, np.where(x<0.7, 0.95, 0.02))                        
y = np.where(np.random.rand(1001)<prob, 1, 0)                                    

在这里我们创建行集合,sticks是一个N×2×2数组 包含垂直线的起点和终点

here we create the line collection, sticks is a N×2×2 array containing the start and end points of our vertical lines

sticks = np.array(list(zip(zip(x, np.zeros(N)), zip(x, y))))                                  
lc = LineCollection(sticks)                                                      

最后,累加总和,在这里归一化为具有与 垂直线

finally, the cumulated sum, here normalized to have the same scale as the vertical lines

cs = (y-0.5).cumsum()                                                            
csmin, csmax = min(cs), max(cs)                                                  
cs = (cs-csmin)/(csmax-csmin) # normalized to 0 ÷ 1                              

我们只需要绘制结果

f, a = plt.subplots()                                                            
a.add_collection(lc)                                                             
a.plot(x, cs, color='red')                                                       
a.grid()                                                                         
a.autoscale()                                                                    

这是情节

,这里是停止区域的详细信息.

您可以平滑cs数据,并使用scipy.optimize到 找出极端的位置.最后这应该有什么问题吗 步骤,请问另一个问题.

You can smooth the cs data and use something from scipy.optimize to spot the position of extremes. Should you have a problem in this last step please ask another question.

这篇关于在1s和0s的巨大列表中提取1s的密集区域的边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆