寻找最小长度RLE [英] Finding the minimum length RLE

查看:100
本文介绍了寻找最小长度RLE的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

经典的RLE算法通过使用数字来表示代表数字的字符出现在文本中该位置的次数来压缩数据。例如:



AAABBAAABBCECE => 3A2B3A2B1C1E1C1E



但是,在上面的示例中,该方法导致的结果更多压缩文本正在使用的空间。更好的主意是使用数字来表示数字后 substring 在给定文本中出现了多少次。例如:



AAABBAAABBCECE => 2AAABB2CE(两次 AAABB,然后两次 CE)。



现在,我的问题是:如何使用这种方法实现一种有效的算法,以找出最佳RLE中的最少字符数?存在蛮力方法,但我需要更快的方法(最多 O(length 2 )。也许我们可以使用动态编程?

解决方案

可以在二次方 cubic 通过动态编程实现二次时间。



以下是一些Python代码:

  import sys 
import numpy as np

bignum = 10000

S = sys.argv [1]#'AAABBAAABBCECE'
N = len(S)

#最长子串匹配长度s [i:]和s [j:]
maxmatch = np.zeros((N + 1,N + 1) ,dtype = int)

对于xrange(i-1,-1)中的i:
对于j在xrange(i + 1,N)中:
如果S [i] == S [j]:
maxmatch [i,j] = maxmatch [i + 1,j + 1] +1

#P [n,k] =给定最后一个k块,编码前n个字符的成本
P = np.zeros((N + 1,N + 1),dtype = int)+ bignum
#Q [n] =编码前n个字符的成本
Q = np。零(N + 1,dtype = int)+ bignum

#基本情况:空字符串
P [0,0] = 0
Q [0] = 0

对于xrange(1,N + 1)中的n:
对于k in xrange(1,n + 1):
如果n-2 * k> = 0:
#s1,s2 = S [nk:n],S [n-2 * k:nk]
#如果s1 == s2:
如果maxmatch [n-2 * k,nk]> = k:
#这里我们要增加计数:C x_1 ... x_k-> C + 1 x_1 ... x_k
P [n,k] = min(P [n,k],P [nk,k])
print'P [%d,%d] = %d'%(n,k,P [n,k])
#这里我们开始一个新块:1 x_1 ... x_k
P [n,k] = min(P [ n,k],Q [nk] +1 + k)
print'P [%d,%d] =%d'%(n,k,P [n,k])
xrange(1,n + 1)中的k:
Q [n] = min(Q [n],P [n,k])

打印

print Q [N]

您可以通过记住选择的方式来重构实际的编码。 / p>

我没有留下细微的皱纹,这就是说如果C大,我们可能必须使用一个额外的字节来保存C + 1。如果您使用的是32位整数,则在该算法的运行时可行的任何情况下都不会出现这种情况。如果您有时使用较短的int以节省空间,则必须考虑这一点,并可能根据最新C的大小在表中添加另一个维度。理论上,这可能会增加log(N)因子,但是我不认为这在实践中会很明显。



编辑:为了@Moron的好处,这是相同的代码,带有更多打印语句,因此您可以

  import sys 
import numpy as np

bignum = 10000

S = sys.argv [1]#'AAABBAAABBCECE'
N = len(S)

#最长子串匹配长度s [i:]和s [j:]
maxmatch = np.zeros((N + 1,N + 1),dtype = int)

对于xrange(N-1)中的i ,-1,-1):xrange(i + 1,N)中j的
:$ b如果S [i] == S [j],则$ b:
maxmatch [i,j] = maxmatch [i + 1,j + 1] +1

#P [n, k] =假设最后一个k是一个块
,则编码前n个字符的成本P = np.zeros((N + 1,N + 1),dtype = int)+ bignum
#Q [n ] =编码前n个字符的成本
Q = np.zeros(N + 1,dtype = int)+ bignum

#基本情况:空字符串
P没有成本[0,0] = 0
Q [0] = 0

对于xrange(1,N + 1)中的n:
对于k in xrange(1,n + 1):
,如果n-2 * k> = 0:
#s1,s2 = S [nk:n],S [n-2 * k:nk]
#如果s1 == s2:如果maxmatch [n-2 * k,n-k]> = k,则

#这里,我们增加计数:C x_1 ... x_k-> C + 1 x_1 ... x_k
P [n,k] = min(P [n,k],P [nk,k])
print P [%d,%d] = %d\t如果我将我的解决方案用于P [%d,%d],并且%s的计数增加,则我只能将S的前%d个字符编码为%d(n(
,k ,P [n,k],n,P [nk,k],nk,k,S [nk:n])
#这里我们开始一个新的块:1 x_1 ... x_k
P [n,k] = min(P [n,k],Q [nk] +1 + k)
print'P [%d,%d] =%d\t我可以先编码如果我将Q [%d]的解决方案用于新块1%s'%(n,k,P [n,k],n,Q [\
nk] + 1 + k,nk,S [nk:n])
对于xrange(1,n + 1)中的k:
Q [n] = min(Q [n], P [n,k])

print
print'Q [%d] =%d\t我只能以%d个字符编码S的前%d个字符!'% (n,Q [n],n,Q [n])
打印


打印Q [N]

在ABCDABCDABCDBCD上输出的最后几行是这样的:

  Q [13] = 7我只能用7个字符对S的前13个字符进行编码! 

P [14,1] = 9如果我将Q [13]的解决方案用于新的块1C
P [14,则我只能用9个字符对S的前14个字符进行编码,2] = 8如果我将Q [12]的解决方案用于新块1BC
P [14,3] = 13我可以将S的前14个字符编码为8个字符,则我可以对前14个字符进行编码如果我将Q [11]的解决方案与新块1DBC
P [14,4] = 13一起使用,则仅用13个字符来表示S对于带有新块1CDBC
P [14,5] = 13的Q [10],如果我对带有新块1BCDBC $的Q [9]使用解,则我只能以13个字符编码S的前14个字符b $ b P [14,6] = 12如果我将Q [8]的解决方案用于新的块1ABCDBC
P [14,7] = 16,我只能将S的前14个字符编码为12个字符如果我将Q [7]的解决方案与新块1DABCDBC
P [14,8] = 16一起使用,则我只能将S的前14个字符编码为16个字符如果我将Q [6]的解决方案用于新块1CDABCDBC
P [14,9] = 16,则我的S拖拉机仅16个字符,如果我使用新块1BCDABCDBC Q [5]的解
P [14,10] = 16如果我对新块1ABCDABCDBC使用Q [4]的解,则我只能以16个字符编码S的前14个字符
P [14,11] = 16如果我将Q [3]的解决方案与新块1DABCDABCDBC一起使用,则我只能将S的前14个字符编码为16个字符。
P [14,12] = 16如果我将Q [2]的解决方案用于新块1CDABCDABCDBC
P [14,13] = 16我只能将S的前14个字符编码为16个字符,但我只能将S的前14个字符编码为16个字符。如果我将解决方案Q [1]用于新块1BCDABCDABCDBC
P [14,14] = 15则我可以对16个字符进行编码如果我对Q [0]使用解决方案,我只能在15个字符中对S的前14个字符进行编码]与一个新块1ABCDABCDABCDBC

Q [14] = 8我可以编码第一个14 cha S字符仅8个字符!

P [15,1] = 10如果我将Q [14]的解决方案用于新的块1D
P [15,则只能用10个字符对S的前15个字符进行编码,2] = 10如果我将Q [13]的解决方案用于新块1CD
P [15,3] = 11我可以将S的前15个字符编码为10个字符,我可以对前15个字符进行编码如果我将我的解决方案用于P [12,3]且BCD的计数增加了
P [15,3] = 9,则我的S只能用11个字符编码如果我使用我的我的S只能用9个字符编码S的前15个字符Q [12]的新块1BCD的解
P [15,4] = 14如果我将Q [11]的新块1DBCD用于解,则我只能以14个字符编码S的前15个字符
P [15,5] = 14如果我将Q [10]的解决方案与新块一起使用,则我只能将S的前15个字符编码为14个字符1CDBCD
P [15,6] = 14如果我将Q [9]的解决方案用于新块1BCDBCD
P [15,7] = 13,我可以将S的前15个字符编码为仅14个字符,我可以编码fir t如果我将Q [8]的解决方案用于新块1ABCDBCD
P [15,8] = 17我只能将S的前15个字符编码为17个字符,但我只能用13个字符将我的解决方案用于Q [7]并带有一个新块1DABCDBCD
P [15,9] = 17如果我将我的解决方案用于Q [6]并带有一个新的块,则我只能将S的前15个字符编码为17个字符块1CDABCDBCD
P [15,10] = 17如果我对新的块1BCDABCDBCD
P [15,11]使用Q [5]的解,则我只能以17个字符编码S的前15个字符] = 17如果我将Q [4]的解决方案与新块1ABCDABCDBCD
P [15,12] = 17我可以对S的前15个字符进行编码,则我只能对S的前15个字符进行编码如果我将解决方案用于Q [3]并带有一个新块1DABCDABCDBCD
P [15,13] = 17仅使用17个字符,如果我将解决方案用于Q,我只能用17个字符编码S的前15个字符[2]用一个新块1CDABCDABCDBCD
P [15,14] = 17我可以编码前15个字符如果我将Q [1]的解决方案用于新的块1BCDABCDABCDBCD
P [15,15] = 16,则我的S rs仅17个字符,如果我使用S的前15个字符仅16个字符Q [0]的新块解决方案1ABCDABCDABCDBCD

Q [15] = 9我只能用9个字符对S的前15个字符进行编码!


The classical RLE algorithm compresses data by using numbers to represent how many times the character following a number appears in the text at that position. For example:

AAABBAAABBCECE => 3A2B3A2B1C1E1C1E

However, in the above example, that method results in even more space being used by the compressed text. A better idea would be to use numbers to represent how many times the substring following a number appears in the given text. For example:

AAABBAAABBCECE => 2AAABB2CE ("AAABB" twice, then "CE" twice).

Now, my question is: how could I implement an efficient algorithm that finds out the minimum number of characters in an optimal RLE using this method? Brute force methods exist, but I need something faster (at most O(length2)). Perhaps we can use dynamic programming?

解决方案

It can be done in quadratic cubic quadratic time via dynamic programming.

Here is some Python code:

import sys
import numpy as np

bignum = 10000

S = sys.argv[1] #'AAABBAAABBCECE'                                                                                                                              
N = len(S)

# length of longest substring match bet s[i:] and s[j:]                                                                                                        
maxmatch = np.zeros( (N+1,N+1), dtype=int)

for i in xrange(N-1,-1,-1):
  for j in xrange(i+1,N):
    if S[i] == S[j]:
      maxmatch[i,j] = maxmatch[i+1,j+1]+1

# P[n,k] = cost of encoding first n characters given that last k are a block                                                                                   
P = np.zeros( (N+1,N+1),dtype=int ) + bignum
# Q[n] = cost of encoding first n characters                                                                                                                   
Q = np.zeros(N+1, dtype=int) + bignum

# base case: no cost for empty string                                                                                                                          
P[0,0]=0
Q[0]=0

for n in xrange(1,N+1):
  for k in xrange(1,n+1):
    if n-2*k >= 0:
#     s1, s2 = S[n-k:n], S[n-2*k:n-k]                                                                                                                          
#     if s1 == s2:                                                                                                                                             
      if maxmatch[n-2*k,n-k] >=k:
        # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k                                                                                     
        P[n,k] = min(P[n,k], P[n-k,k])
        print 'P[%d,%d] = %d' % (n,k,P[n,k])
    # Here we are starting a new block: 1 x_1...x_k                                                                                                            
    P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
    print 'P[%d,%d] = %d' % (n,k,P[n,k])
  for k in xrange(1,n+1):
    Q[n] = min(Q[n], P[n,k])

  print

print Q[N]

You can reconstruct the actual encoding by remembering your choices along the way.

I have left out a minor wrinkle, which is that we might have to use an extra byte to hold C+1 if C is large. If you are using 32-bit ints, this will not come up in any context where this algorithm's runtime is feasible. If you are sometimes using shorter ints to save space then you will have to think about it, and maybe add another dimension to your table based on the size of the latest C. In theory, this might add a log(N) factor, but I don't think this will be apparent in practice.

Edit: For the benefit of @Moron, here is the same code with more print statements, so that you can more easily see what the algorithm is thinking:

import sys
import numpy as np

bignum = 10000

S = sys.argv[1] #'AAABBAAABBCECE'                                                                                                                              
N = len(S)

# length of longest substring match bet s[i:] and s[j:]                                                                                                        
maxmatch = np.zeros( (N+1,N+1), dtype=int)

for i in xrange(N-1,-1,-1):
  for j in xrange(i+1,N):
    if S[i] == S[j]:
      maxmatch[i,j] = maxmatch[i+1,j+1]+1

# P[n,k] = cost of encoding first n characters given that last k are a block                                                                                   
P = np.zeros( (N+1,N+1),dtype=int ) + bignum
# Q[n] = cost of encoding first n characters                                                                                                                   
Q = np.zeros(N+1, dtype=int) + bignum

# base case: no cost for empty string                                                                                                                          
P[0,0]=0
Q[0]=0

for n in xrange(1,N+1):
  for k in xrange(1,n+1):
    if n-2*k >= 0:
#     s1, s2 = S[n-k:n], S[n-2*k:n-k]                                                                                                                          
#     if s1 == s2:                                                                                                                                             
      if maxmatch[n-2*k,n-k] >=k:
        # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k                                                                                     
        P[n,k] = min(P[n,k], P[n-k,k])
        print "P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for P[%d,%d] with %s's count incremented" % (n\
,k,P[n,k],n,P[n-k,k],n-k,k,S[n-k:n])
    # Here we are starting a new block: 1 x_1...x_k                                                                                                            
    P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
    print 'P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for Q[%d] with a new block 1%s' % (n,k,P[n,k],n,Q[\
n-k]+1+k,n-k,S[n-k:n])
  for k in xrange(1,n+1):
    Q[n] = min(Q[n], P[n,k])

  print
  print 'Q[%d] = %d\t I can encode first %d characters of S in only %d characters!' % (n,Q[n],n,Q[n])
  print


print Q[N]

The last few lines of its output on ABCDABCDABCDBCD are like so:

Q[13] = 7        I can encode first 13 characters of S in only 7 characters!

P[14,1] = 9      I can encode first 14 characters of S in only 9 characters if I use my solution for Q[13] with a new block 1C
P[14,2] = 8      I can encode first 14 characters of S in only 8 characters if I use my solution for Q[12] with a new block 1BC
P[14,3] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[11] with a new block 1DBC
P[14,4] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[10] with a new block 1CDBC
P[14,5] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[9] with a new block 1BCDBC
P[14,6] = 12     I can encode first 14 characters of S in only 12 characters if I use my solution for Q[8] with a new block 1ABCDBC
P[14,7] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[7] with a new block 1DABCDBC
P[14,8] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[6] with a new block 1CDABCDBC
P[14,9] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[5] with a new block 1BCDABCDBC
P[14,10] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[4] with a new block 1ABCDABCDBC
P[14,11] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[3] with a new block 1DABCDABCDBC
P[14,12] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBC
P[14,13] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBC
P[14,14] = 15    I can encode first 14 characters of S in only 15 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBC

Q[14] = 8        I can encode first 14 characters of S in only 8 characters!

P[15,1] = 10     I can encode first 15 characters of S in only 10 characters if I use my solution for Q[14] with a new block 1D
P[15,2] = 10     I can encode first 15 characters of S in only 10 characters if I use my solution for Q[13] with a new block 1CD
P[15,3] = 11     I can encode first 15 characters of S in only 11 characters if I use my solution for P[12,3] with BCD's count incremented
P[15,3] = 9      I can encode first 15 characters of S in only 9 characters if I use my solution for Q[12] with a new block 1BCD
P[15,4] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[11] with a new block 1DBCD
P[15,5] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[10] with a new block 1CDBCD
P[15,6] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[9] with a new block 1BCDBCD
P[15,7] = 13     I can encode first 15 characters of S in only 13 characters if I use my solution for Q[8] with a new block 1ABCDBCD
P[15,8] = 17     I can encode first 15 characters of S in only 17 characters if I use my solution for Q[7] with a new block 1DABCDBCD
P[15,9] = 17     I can encode first 15 characters of S in only 17 characters if I use my solution for Q[6] with a new block 1CDABCDBCD
P[15,10] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[5] with a new block 1BCDABCDBCD
P[15,11] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[4] with a new block 1ABCDABCDBCD
P[15,12] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[3] with a new block 1DABCDABCDBCD
P[15,13] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBCD
P[15,14] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBCD
P[15,15] = 16    I can encode first 15 characters of S in only 16 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBCD

Q[15] = 9        I can encode first 15 characters of S in only 9 characters!

这篇关于寻找最小长度RLE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆