就地游程长度编码算法 [英] In Place Run Length Encoding Algorithm

查看:157
本文介绍了就地游程长度编码算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个采访问题:


给出一个输入字符串: aaaaabcddddee ,将其转换为 a5b1c1d4e2


另外一个约束是,这需要就地完成,意味着不应使用多余的空间(数组)。



请确保编码后的字符串将始终适合原始字符串。换句话说,不会出现像 abcde 这样的字符串,因为它会被编码为 a1b1c1d1e1 ,该字符串比



一个面试官给我的提示是遍历字符串一次,找到节省的空间。



还是有一段时间,我没有使用额外的变量,输入字符串中的某些值可能会被覆盖。



有什么建议会受到欢迎吗?

解决方案

这是一个很好的面试问题。



要点



有两个关键点:


  1. 单个字符必须编码为 c1 ;

  2. 编码长度将始终小于原始数组。

从1开始,我们知道每个字符都需要至少 2 个要编码的位置。也就是说,只有单个字符会需要更多的空格进行编码



简单方法



在关键点上,我们注意到单个字符在编码过程中给我们带来很多问题,因为它们可能没有足够的空间容纳编码的字符串。那么,我们如何先保留它们,然后首先压缩其他字符呢?



例如,我们对 aaaaabcddddee 进行编码

  aaaaabcddddee 
_____a5bcd4e2

然后,我们可以安全地从头开始,对部分编码的序列进行编码,给定关键点2,以便有足够的空间。



分析



好像我们已经找到解决方案了,我们完成了吗?否。请考虑以下字符串:

  aaa3dd11ee4ff666 

问题不限制字符范围,因此我们也可以使用数字。在这种情况下,如果我们仍然使用相同的方法,则会得到以下结果:

  aaa3dd11ee4ff666 
__a33d212e24f263

好吧,现在告诉我,如何将游程长度与原始字符串中的数字区分开? p>

好,我们需要尝试其他事情。



让我们定义编码收益(E) > as:编码序列与原始连续字符序列之间的长度差。



例如, aa E = 0 ,因为 aa 将被编码为 a2 ,并且它们之间没有长度差异; aaa E = 1 ,因为它将被编码为 a3 ,则编码后的字符和原始字符之间的长度差为 1 。让我们看一下单个字符的大小写,它的 E 是什么?是的,它是 -1 。根据定义,我们可以推导出 E 的公式: E = ori_len-encode_len



现在让我们回到问题所在。从关键点2开始,我们知道编码后的字符串将总是比原始字符串短。我们如何使用 E 重述此关键点?



很简单: sigma( E_i)> = 0 ,其中 E_i 是i <的编码收益 sup> th 连续字符子字符串。



例如,您在问题中提供的示例: aaaaabcddddee ,可以分为5部分:

  E(0)= 5-2 = 3 // aaaaa-> a5 
E(1)= 1-2 = -1 // b-> b1
E(2)= 1-2 = -1 // c-> c1
E(3)= 4-2 = 2 // dddd-> d4
E(4)= 2-2 = 0 // ee-> e2

总和为: 3 +(-1)+( -1)+ 2 + 0 = 3> 0 。这意味着编码后将剩下3个空格。



但是,从此示例中,我们可以看到一个潜在的问题:因为我们正在做求和,即使最终答案大于0,则中间可能会有一些负数!



是的,这是一个问题,而且非常严重。如果我们得到 E 低于 0 ,这意味着我们没有足够的空间来编码当前字符,并且将覆盖



但是,为什么我们需要从第一组中求和呢?为什么我们不能从中间的某个地方开始求和以跳过负数呢?让我们看一个例子:

  2 0 -1 -1 -1 -1 1 3 -1 

如果我们从一开始就进行总结,那么在添加第三个 -1 在索引4(从0开始);如果我们从索引5求和,当到达终点时返回索引0,就没有问题。



算法



分析使我们对算法有了更深入的了解:


  1. 从头开始,计算 E 当前连续组的值,并加到总计 E_total ;

  2. 如果 E_total 仍然非负(> = 0),我们很好,我们可以安全地进行下一个小组;

  3. 如果 E_total 降至0以下,我们需要从当前位置重新开始,即清除 E_total 并继续

如果我们到达序列的末尾,并且 E_total 仍然是非负数,最后的起点是一个好的开始!此步骤需要 O(n)时间。通常,我们需要环回并再次检查,但是从关键点2开始,我们肯定会有一个正确的答案,因此我们可以在此安全地停止。



然后我们可以继续回到起点并开始传统的行程编码,到达终点后,我们需要回到序列的开头以完成第一部分。棘手的部分是,我们需要利用字符串末尾的剩余空间。之后,我们需要进行一些移位,以防万一我们遇到一些定单问题,并删除任何多余的空白,然后我们终于完成了:)



因此,我们有一个解决方案(代码只是伪代码,尚未经过验证):

  / /找到第一个位置
i = j = E_total = pos = 0;
while(i< s.length){
while(s [i] == s [j])j ++;
E_total + = calculate_encode_benefit(i,j);
if(E_total< 0){
E_total = 0;
pos = j;
}
i = j;
}

//像往常一样执行游程长度编码:
//从pos开始,以len结束-1,第一个可用位置是pos
int last_available_pos =运行长度(s,pos,len(s)-1,pos);
//一个棘手的部分是要利用最后的剩余空间!!!
int fin_pos =运行长度(s,0,pos-1,last_available_pos);
//消除白色
消除(s,fin_pos,pos);
//由于消除而更新了last_available_pos
last_available_pos-= pos-fin_pos< 0? 0:pos-fin_pos;
//向后旋转
rotation(s,last_available_pos);



复杂度



我们有4个部分算法:


  1. 查找起始位置: O(n)

  2. 整个字符串的运行长度编码: O(n)

  3. 消除空白: O(n)

  4. 就地字符串旋转 O(n)

因此,我们总共有 O(n)



可视化



假设我们需要对该字符串进行编码: abccdddefggggghhhhh



第一步,我们需要找到起始位置:

 第1组:a-> E_total + = -1-> E_total = -1< 0-> E_total = 0,pos = 1; 
第2组:b-> E_total + = -1-> E_total = -1< 0-> E_total = 0,pos = 2;
组3:cc-> E_total + = 0-> E_total = 0> = 0->继续;
组4:ddd-> E_total + = 1-> E_total = 1> = 0->继续;
第5组:e-> E_total + = -1-> E_total = 0> = 0->继续;
第6组:f-> E_total + = -1-> E_total = -1< 0-> E_total = 0,pos = 9;
第7组:ggggg-> E_total + = 3-> E_total = 3> = 0->继续;
第8组:hhhhh-> E_total + = 3-> E_total = 6> = 0->结束;

因此开始位置为9:

  v这是起点
abccdddefggggghhhhh
abccdddefg5h5______
^ last_available_pos,我们需要利用这些剩余空间
abccdddefg5h5a1b1c2
d3e1f1___g5h5a1b1c2
^^^删除空白
d3e1f1g5h5a1b1c2
^ last_available_pos,旋转
a1b1c2d3e1f1g5h5

最后一个单词



这个问题并非微不足道,实际上是将几个传统的编码面试问题自然地粘合在一起。建议的思维流程为:


  1. 观察模式并找出关键点;

  2. 意识到空间不足的原因是由于编码单个字符;

  3. 量化了每个连续字符组的编码收益/成本(又名编码收益);

  4. 使用您建议的量化解释原始语句;

  5. 弄清楚算法以找到一个好的起点;

  6. 弄清楚如何以一个良好的起点进行游程长度编码;

  7. 意识到您需要旋转编码的字符串并消除空格;

  8. 弄清楚该算法执行就地字符串旋转;

  9. 弄清楚该算法要就地消除空格。

说实话,对于受访者来说,在短时间内提出可靠的算法有点挑战,因此您的分析流程确实很重要。不要说什么,要显示出您的思想流向,这可以帮助面试官找出您当前的阶段。


I encountered an interview question:

Given a input String: aaaaabcddddee, convert it to a5b1c1d4e2.

One extra constraint is, this needs to be done in-place, means no extra space(array) should be used.

It is guaranteed that the encoded string will always fit in the original string. In other words, string like abcde will not occur, since it will be encoded to a1b1c1d1e1 which occupies more space than the original string.

One hint interviewer gave me was to traverse the string once and find the space that is saved.

Still I am stuck as some times, without using extra variables, some values in the input string may be overwritten.

Any suggestions will be appreciated?

解决方案

This is a good interview question.

Key Points

There are 2 key points:

  1. Single character must be encoded as c1;
  2. The encoded length will always be smaller than the original array.

Since 1, we know each character requires at least 2 places to be encoded. This is to say, only single character will require more spaces to be encoded.

Simple Approach

From the key points, we notice that the single character causes us a lot problem during the encoding, because they might not have enough place to hold the encoded string. So how about we leave them first, and compressed the other characters first?

For example, we encode aaaaabcddddee from the back while leaving the single character first, we will get:

aaaaabcddddee
_____a5bcd4e2

Then we could safely start from the beginning and encoding the partly encoded sequence, given the key point 2 such that there will be enough spaces.

Analysis

Seems like we've got a solution, are we done? No. Consider this string:

aaa3dd11ee4ff666

The problem doesn't limit the range of characters, so we could use digit as well. In this case, if we still use the same approach, we will get this:

aaa3dd11ee4ff666
__a33d212e24f263

Ok, now tell me, how do you distinguish the run-length from those numbers in the original string?

Well, we need to try something else.

Let's define Encode Benefit (E) as: the length difference between the encoded sequence and the original consecutive character sequence..

For example, aa has E = 0, since aa will be encoded to a2, and they have no length difference; aaa has E = 1, since it will be encoded as a3, and the length difference between the encoded and the original is 1. Let's look at the single character case, what's its E? Yes, it's -1. From the definition, we could deduce the formula for E: E = ori_len - encoded_len.

Now let's go back to the problem. From key point 2, we know the encoded string will always be shorter than the original one. How do we use E to rephrase this key point?

Very simple: sigma(E_i) >= 0, where E_i is the Encode Benefit of the ith consecutive character substring.

For example, the sample you gave in your problem: aaaaabcddddee, can be broken down into 5 parts:

E(0) = 5 - 2 = 3  // aaaaa -> a5
E(1) = 1 - 2 = -1 // b -> b1
E(2) = 1 - 2 = -1 // c -> c1
E(3) = 4 - 2 = 2  // dddd -> d4
E(4) = 2 - 2 = 0  // ee -> e2

And the sigma will be: 3 + (-1) + (-1) + 2 + 0 = 3 > 0. This means there will be 3 spaces left after encoding.

However, from this example, we could see a potential problem: since we are doing summing, even if the final answer is bigger than 0, it's possible to get some negatives in the middle!

Yes, this is a problem, and it's quite serious. If we get E falls below 0, this means we do not have enough space to encode the current character and will overwrite some characters after it.

But but but, why do we need to sum it from the first group? Why can't we start summing from somewhere in the middle to skip the negative part? Let's look at an example:

2 0 -1 -1 -1 1 3 -1

If we sum up from the beginning, we will fall below 0 after adding the third -1 at index 4 (0-based); if we sum up from index 5, loop back to index 0 when we reach the end, we have no problem.

Algorithm

The analysis gives us an insight on the algorithm:

  1. Start from the beginning, calculate E of the current consecutive group, and add to the total E_total;
  2. If E_total is still non-negative (>= 0), we are fine and we could safely proceed to the next group;
  3. If the E_total falls below 0, we need to start over from the current position, i.e. clear E_total and proceed to the next position.

If we reach the end of the sequence and E_total is still non-negative, the last starting point is a good start! This step takes O(n) time. Usually we need to loop back and check again, but since key point 2, we will definitely have a valid answer, so we could safely stop here.

Then we could go back to the starting point and start traditional run-length encoding, after we reach the end we need to go back to the beginning of the sequence to finish the first part. The tricky part is, we need to make use the remaining spaces at the end of the string. After that, we need to do some shifting just in case we have some order issues, and remove any extra white spaces, then we are finally done :)

Therefore, we have a solution (the code is just a pseudo and hasn't been verified):

// find the position first
i = j = E_total = pos = 0;
while (i < s.length) {
    while (s[i] == s[j]) j ++;
    E_total += calculate_encode_benefit(i, j);
    if (E_total < 0) {
        E_total = 0;
        pos = j;
    }
    i = j;
}

// do run length encoding as usual:
// start from pos, end with len(s) - 1, the first available place is pos
int last_available_pos = runlength(s, pos, len(s)-1, pos);
// a tricky part here is to make use of the remaining spaces from the end!!!
int fin_pos = runlength(s, 0, pos-1, last_available_pos);
// eliminate the white
eliminate(s, fin_pos, pos);
// update last_available_pos because of elimination
last_available_pos -= pos - fin_pos < 0 ? 0 : pos - fin_pos;
// rotate back
rotate(s, last_available_pos);

Complexity

We have 4 parts in the algorithm:

  1. Find the starting place: O(n)
  2. Run-Length-Encoding on the whole string: O(n)
  3. White space elimination: O(n)
  4. In place string rotation: O(n)

Therefore we have O(n) in total.

Visualization

Suppose we need to encode this string: abccdddefggggghhhhh

First step, we need to find the starting position:

Group 1: a     -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 1;
Group 2: b     -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 2;
Group 3: cc    -> E_total += 0  -> E_total = 0 >= 0 -> proceed;
Group 4: ddd   -> E_total += 1  -> E_total = 1 >= 0 -> proceed;
Group 5: e     -> E_total += -1 -> E_total = 0 >= 0 -> proceed;
Group 6: f     -> E_total += -1 -> E_total = -1 < 0 -> E_total = 0, pos = 9;
Group 7: ggggg -> E_total += 3  -> E_total = 3 >= 0 -> proceed;
Group 8: hhhhh -> E_total += 3  -> E_total = 6 >= 0 -> end;

So the start position will be 9:

         v this is the starting point
abccdddefggggghhhhh
abccdddefg5h5______
             ^ last_available_pos, we need to make use of these remaining spaces
abccdddefg5h5a1b1c2
d3e1f1___g5h5a1b1c2
      ^^^ remove the white space
d3e1f1g5h5a1b1c2
          ^ last_available_pos, rotate
a1b1c2d3e1f1g5h5

Last Words

This question is not trivial, and actually glued several traditional coding interview questions together naturally. A suggested mind flow would be:

  1. observe the pattern and figure out the key points;
  2. realize the reason for insufficient space is because of encoding single character;
  3. quantize the benefit/cost of encoding on each consecutive characters group (a.k.a Encoding Benefit);
  4. use the quantization you proposed to explain the original statement;
  5. figure out the algorithm to find a good starting point;
  6. figure out how to do run-length-encoding with a good starting point;
  7. realize you need to rotate the encoded string and eliminate the white spaces;
  8. figure out the algorithm to do in place string rotation;
  9. figure out the algorithm to do in place white space elimination.

To be honest, it's a bit challenging for an interviewee to come up with a solid algorithm in a short time, so your analysis flow really matters. Don't say nothing, show your mind flow, this helps the interviewer to find out your current stage.

这篇关于就地游程长度编码算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆