最长等间距的子序列 [英] Longest equally-spaced subsequence

查看:208
本文介绍了最长等间距的子序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一百万整数有序,我想找到最长的序列,其中连续对之间的差别是相等的。例如

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example

1, 4, 5, 7, 8, 12

有一个子

   4,       8, 12

我的幼稚的方法是贪婪的,只是检查多远,你可以从每一个点扩展序列。这需要每点似乎 O(N²)的时间。

有没有更快的方法来解决这个问题?

Is there a faster way to solve this problem?

更新。我会测试,尽快给出的答案尽可能code(谢谢)。但是很显然已经在使用N ^ 2的内存将无法正常工作。到目前为止,没有任何code,与输入终止为 [中的xrange random.randint(0,100000)的R(200000)]

Update. I will test the code given in the answers as soon as possible (thank you). However it is clear already that using n^2 memory will not work. So far there is no code that terminates with the input as [random.randint(0,100000) for r in xrange(200000)] .

时序。我与我的32位系统下的输入数据进行测试。

Timings. I tested with the following input data on my 32 bit system.

a= [random.randint(0,10000) for r in xrange(20000)] 
a.sort()

  • ZelluX的动态规划方法使用的RAM和1.6G需要2分14秒。随着pypy花费仅为9秒!然而,它崩溃与大投入内存错误。
  • 阿明的O(ND)时间的方法耗时9秒pypy但内存只有20MB。当然,这将是非常糟糕的,如果范围是大得多。低内存占用意味着我可以用一个= [random.randint(0,100000)中的xrange R(200000)也考验,但它并没有在几分钟内完成我把它与pypy。
  • 在为了能够测试Kluev的第I重新运行该方法

    In order to be able to test the method of Kluev's I reran with

    a= [random.randint(0,40000) for r in xrange(28000)] 
    a = list(set(a))
    a.sort()
    

    做长的名单约 20000 。所有的时序与pypy

    to make a list of length roughly 20000. All timings with pypy

    • ZelluX,9秒
    • Kluev 20秒
    • 阿明52秒

    看来,如果ZelluX方法可以做线性空间,这将是明显的赢家。

    It seems that if the ZelluX method could be made linear space it would be the clear winner.

    推荐答案

    更新:首先这里描述过时通过的阿明Rigo旅馆的第二个答案,这是更简单,更高效。但是,这两种方法有一个缺点。它们需要很多时间,找到结果一百万整数。所以,我想两个变种(见这个答案的后半部分),其中假设输入整数的范围是有限的。这样的限制,允许更快的算法。我也试着优化阿明Rigo旅馆的code。见我的基准测试结果在最后。

    Update: First algorithm described here is obsoleted by Armin Rigo's second answer, which is much simpler and more efficient. But both these methods have one disadvantage. They need many hours to find the result for one million integers. So I tried two more variants (see second half of this answer) where the range of input integers is assumed to be limited. Such limitation allows much faster algorithms. Also I tried to optimize Armin Rigo's code. See my benchmarking results at the end.

    下面是算法使用O(N)的内存的想法。时间复杂度为O(N 2 日志N),但可能会降低到O(N 2 )。

    Here is an idea of algorithm using O(N) memory. Time complexity is O(N2 log N), but may be decreased to O(N2).

    算法使用以下的数据结构:

    Algorithm uses the following data structures:

    1. $ P $光伏:指标指向previous(可能不完全)子元素的数组
    2. :与HashMap的连续对的序列和值=另外两个包含HashMap的键=差。对于这些包含HashMap:键=开始/结束序列,值=一双指数(序列长度,结束的子序列/开始索引)
    3. PQ :优先级队列为所有可能的差异化的值存储在 $ P $光伏和子序列
    1. prev: array of indexes pointing to previous element of (possibly incomplete) subsequence.
    2. hash: hashmap with key = difference between consecutive pairs in subsequence and value = two other hashmaps. For these other hashmaps: key = starting/ending index of the subsequence, value = pair of (subsequence length, ending/starting index of the subsequence).
    3. pq: priority queue for all possible "difference" values for subsequences stored in prev and hash.

    算法:

    1. 初始化 $ P $光伏与指数 I-1 。更新 PQ 注册这一步找到的所有(不完全)的子序列和他们的差异。
    2. 获得(和删除)最小的差异从 PQ 。获取相应的记录从和扫描的二级哈希映射之一。在这次与给定的差异化的所有子序列都是完整的。如果二级哈希映射包含子长度比找到更好的,到目前为止,更新的最好成绩。
    3. 在阵列 $ P $光伏:任何顺序上的步骤#2,减指数和更新中的每个元素并有可能 PQ 。虽然更新,我们可以执行下列操作之一:增加长度为1的新序列,或由1种一些现有的序列,或合并现有的两个子序列
    4. 在卸下第2步中哈希表的记录。
    5. 从第2步继续,而 PQ 不为空。
    1. Initialize prev with indexes i-1. Update hash and pq to register all (incomplete) subsequences found on this step and their "differences".
    2. Get (and remove) smallest "difference" from pq. Get corresponding record from hash and scan one of second-level hash maps. At this time all subsequences with given "difference" are complete. If second-level hash map contains subsequence length better than found so far, update the best result.
    3. In the array prev: for each element of any sequence found on step #2, decrement index and update hash and possibly pq. While updating hash, we could perform one of the following operations: add a new subsequence of length 1, or grow some existing subsequence by 1, or merge two existing subsequences.
    4. Remove hash map record found on step #2.
    5. Continue from step #2 while pq is not empty.

    此算法的更新O(N)的 $ P $光伏 O(N),每个时代的元素。而这些更新可能需要增加一个新的差异化,以 PQ 。这一切都意味着时间复杂度是O(N 2 日志N),如果我们用简单的堆实现 PQ 。要降低到O(N 2 ),我们可能会使用更高级的优先级队列的实现。一些可能性此页上列出:优先级队列

    This algorithm updates O(N) elements of prev O(N) times each. And each of these updates may require to add a new "difference" to pq. All this means time complexity of O(N2 log N) if we use simple heap implementation for pq. To decrease it to O(N2) we might use more advanced priority queue implementations. Some of the possibilities are listed on this page: Priority Queues.

    查看 Ideone 相应的Python code。这code不允许在列表中重复的元素。它可以解决这个问题,但是这将是一个很好的优化反正删除重复(并分别找到超越重复最长的序列)。

    See corresponding Python code on Ideone. This code does not allow duplicate elements in the list. It is possible to fix this, but it would be a good optimization anyway to remove duplicates (and to find the longest subsequence beyond duplicates separately).

    一些优化后的同一code。在这里搜索尽快终止序列的长度乘以可能子差异化超出源列表的范围。

    And the same code after a little optimization. Here search is terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range.

    阿明Rigo旅馆的code是简单,pretty的效率。但在某些情况下,它确实可避免一些额外的计算。搜索可以尽快终止序列的长度乘以可能子差异化超出源列表范围:

    Armin Rigo's code is simple and pretty efficient. But in some cases it does some extra computations that may be avoided. Search may be terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range:

    def findLESS(A):
      Aset = set(A)
      lmax = 2
      d = 1
      minStep = 0
    
      while (lmax - 1) * minStep <= A[-1] - A[0]:
        minStep = A[-1] - A[0] + 1
        for j, b in enumerate(A):
          if j+d < len(A):
            a = A[j+d]
            step = a - b
            minStep = min(minStep, step)
            if a + step in Aset and b - step not in Aset:
              c = a + step
              count = 3
              while c + step in Aset:
                c += step
                count += 1
              if count > lmax:
                lmax = count
        d += 1
    
      return lmax
    
    print(findLESS([1, 4, 5, 7, 8, 12]))
    


    如果在源数据的整数(M)的范围小,一个简单的算法可以为O(M 2 )的时间和O(M)空间:


    If range of integers in source data (M) is small, a simple algorithm is possible with O(M2) time and O(M) space:

    def findLESS(src):
      r = [False for i in range(src[-1]+1)]
      for x in src:
        r[x] = True
    
      d = 1
      best = 1
    
      while best * d < len(r):
        for s in range(d):
          l = 0
    
          for i in range(s, len(r), d):
            if r[i]:
              l += 1
              best = max(best, l)
            else:
              l = 0
    
        d += 1
    
      return best
    
    
    print(findLESS([1, 4, 5, 7, 8, 12]))
    

    有类似于第一方法由阿明Rigo旅馆,但它不使用任何动态数据结构。我想,源数据没有重复。而(保持code简单的)我还假设最小输入值不为负,接近零。

    It is similar to the first method by Armin Rigo, but it doesn't use any dynamic data structures. I suppose source data has no duplicates. And (to keep the code simple) I also suppose that minimum input value is non-negative and close to zero.

    previous算法可以得到改进。下面所示的code实现位集合作为一个内置Python整数。它具有相同的假设:没有重复,最小输入值是非负的和接近零。时间复杂度为O(M 2 *登录L),其中L是最佳的子序列的长度,空间复杂度为O(M)

    Previous algorithm may be improved if instead of the array of booleans we use a bitset data structure and bitwise operations to process data in parallel. The code shown below implements bitset as a built-in Python integer. It has the same assumptions: no duplicates, minimum input value is non-negative and close to zero. Time complexity is O(M2 * log L) where L is the length of optimal subsequence, space complexity is O(M):

    def findLESS(src):
      r = 0
      for x in src:
        r |= 1 << x
    
      d = 1
      best = 1
    
      while best * d < src[-1] + 1:
        c = best
        rr = r
    
        while c & (c-1):
          cc = c & -c
          rr &= rr >> (cc * d)
          c &= c-1
    
        while c != 1:
          c = c >> 1
          rr &= rr >> (c * d)
    
        rr &= rr >> d
    
        while rr:
          rr &= rr >> d
          best += 1
    
        d += 1
    
      return best
    


    基准:

    输入数据(约100000的整数)的生成是这样的:

    Input data (about 100000 integers) is generated this way:

    random.seed(42)
    s = sorted(list(set([random.randint(0,200000) for r in xrange(140000)])))
    

    和实现最快速的算法,我还使用了下面的数据(100多万的整数):

    And for fastest algorithms I also used the following data (about 1000000 integers):

    s = sorted(list(set([random.randint(0,2000000) for r in xrange(1400000)])))
    

    所有的结果显示时间(秒):

    All results show time in seconds:

    Size:                         100000   1000000
    Second answer by Armin Rigo:     634         ?
    By Armin Rigo, optimized:         64     >5000
    O(M^2) algorithm:                 53      2940
    O(M^2*L) algorithm:                7       711
    

    这篇关于最长等间距的子序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆