寻找具有位掩码数据差距 [英] Finding data gaps with bit masking

查看:102
本文介绍了寻找具有位掩码数据差距的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我面对发现在数字序列的不连续性给定长度的(间隙)的问题。因此,例如,由于 [1,2,3,7,8,9,10] 的间隙长度= 3 ,我会找到 [4,5,6] 。如果间隙长度= 4 ,我会找到什么。真正的序列是,当然,要长得多。我已经看到了这个问题,在不少帖子,并且它有各种应用程序和可能的实现。

I'm faced with a problem of finding discontinuities (gaps) of a given length in a sequence of numbers. So, for example, given [1,2,3,7,8,9,10] and a gap of length=3, I'll find [4,5,6]. If the gap is length=4, I'll find nothing. The real sequence is, of course, much longer. I've seen this problem in quite a few posts, and it had various applications and possible implementations.

我认为可能的工作,应该是比较快速的方法之一是重新present全套含1可用的号码和0失踪位阵列 - 所以上面看起来像 [1,1,1,0,0,0,1,1,1,1] 。然后可能运行将异或掩蔽给定长度的阵列与成套的窗口函数,直到所有的地点导致1.这将需要在大致〜为O(n)在整个序列的单通,加的成本掩蔽在每次运行。

One way I thought might work and should be relatively quick is to represent the complete set as a bit array containing 1 for available number and 0 for missing - so the above will look like [1,1,1,0,0,0,1,1,1,1]. Then possibly run a window function that'll XOR mask an array of the given length with the complete set until all locations result in 1. This will require a single pass over the whole sequence in roughly ~O(n), plus the cost of masking in each run.

这就是我设法想出了:

def find_gap(array, start=0, length=10):
    """
    array:  assumed to be of length MAX_NUMBER and contain 0 or 1 
            if the value is actually present
    start:  indicates what value to start looking from
    length: what the length the gap should be
    """

    # create the bitmask to check against
    mask = ''.join( [1] * length )

    # convert the input 0/1 mapping to bit string
    # e.g - [1,0,1,0] -> '1010'
    bits =''.join( [ str(val) for val in array ] )

    for i in xrange(start, len(bits) - length):

        # find where the next gap begins
        if bits[i] != '0': continue

        # gap was found, extract segment of size 'length', compare w/ mask
        if (i + length < len(bits)):
            segment = bits[i:i+length]

            # use XOR between binary masks
            result  = bin( int(mask, 2) ^ int(segment, 2) )

            # if mask == result in base 2, gap found
            if result == ("0b%s" % mask): return i

    # if we got here, no gap exists
    return -1

这是为100K〜相当快(小于1秒)。我倒是如何使这个更快/更较大的集高效AP preciate提示。谢谢!

This is fairly quick for ~100k (< 1 sec). I'd appreciate tips on how to make this faster / more efficient for larger sets. thanks!

推荐答案

您可以使用XOR和移位,它并运行大致O(n)时间。

You could use XOR and shift and it does run in roughly O(n) time.

然而,在实践中,建立索引(大于某一最小长度的所有间隙的哈希表)可能是一个较好的方法。

However, in practice, building an index (hash list of all gaps greater then some minimum length) might be a better approach.

假设你开始与这些整数(而不是一个位掩码),那么您打造简单地走在序列的指数序列;你觉得比你的门槛,你这个差距尺寸添加到您的字典更大差距的任何时间(实例它作为一个空列表如果有必要,再追加序列中的偏移量。

Assuming that you start with a sequence of these integers (rather than a bitmask) then you build an index by simply walking over the sequence; any time you find a gap greater than your threshold you add that gap size to your dictionary (instantiate it as an empty list if necessary, and then append the offset in the sequence.

在你到底有没有在你的序列中的每个间隙(比所需的阈值)的列表。

At the end you have a list of every gap (greater than your desired threshold) in your sequence.

有关这种方法的一个好处是,你应该能够为您修改基本列表保持这个指数。因此,为O(n *日志(n))的初始花费的时间构建索引为O摊销(日志(n))的成本进行后续的查询和更新索引。

One nice thing about this approach is that you should be able to maintain this index as you modify the base list. So the O(n*log(n)) initial time spent building the index is amortized by O(log(n)) cost for subsequent queries and updates to the indexes.

下面是建立一个非常粗略的功能 gap_index()

Here's a very crude function to build the gap_index():

def gap_idx(s, thresh=2):
    ret = dict()
    lw = s[0]  # initial low val.
    for z,i in enumerate(s[1:]):
        if i - lw < thresh:
            lw = i
            continue
        key = i - lw
        if key not in ret:
            ret[key] = list()
        ret[key].append(z)
        lw = i
    return ret

一个班同时维护一个数据集和索引可能最好的内置的对开模块及其 insort()函数左右建成。

A class to maintain both a data set and the index might best be built around the built-in 'bisect' module and its insort() function.

这篇关于寻找具有位掩码数据差距的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆