无损分级游程编码 [英] Lossless hierarchical run length encoding

查看:123
本文介绍了无损分级游程编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

欲总结而不是玉米preSS以相似的方式来运行长度编码,但在嵌套感。

I want to summarize rather than compress in a similar manner to run length encoding but in a nested sense.

例如,我想:ABCBCABCBCDEEF成为:(2A(2BC))D(2E)F

For instance, I want : ABCBCABCBCDEEF to become: (2A(2BC))D(2E)F

我不关注,一个选项被两个相同的可能嵌套例如之间挑

I am not concerned that an option is picked between two identical possible nestings E.g.

ABBABBABBABA可以是(3ABB)ABA或甲(3BBA)BA彼此相同的COM pressed长度的,尽管有不同的结构。

ABBABBABBABA could be (3ABB)ABA or A(3BBA)BA which are of the same compressed length, despite having different structures.

不过,我想选择是最贪婪的。例如:

However I do want the choice to be MOST greedy. For instance:

ABCDABCDCDCDCD会选(2ABCD)(3CD) - 长度6在原始符号小于ABCDAB(4CD),它是长度为8在原始符号的

ABCDABCDCDCDCD would pick (2ABCD)(3CD) - of length six in original symbols which is less than ABCDAB(4CD) which is length 8 in original symbols.

在背景方面我有,我想总结了一些重复的模式。使得数据更易消化。我不想破坏数据的逻辑顺序,因为它是非常重要的。但我想总结一下,说,符号A 3次出现,其次是符号XYZ 20事件等,这可以直观地显示在一个嵌套的感觉。

In terms of background I have some repeating patterns that I want to summarize. So that the data is more digestible. I don't want to disrupt the logical order of the data as it is important. but I do want to summarize it , by saying, symbol A times 3 occurrences, followed by symbols XYZ for 20 occurrences etc. and this can be displayed in a nested sense visually.

欢迎的想法。

推荐答案

我是pretty的肯定这是不是最好的方法,并根据模式的长度,可能有一个运行时间和内存使用情况这是行不通的,但这里的一些code。

I'm pretty sure this isn't the best approach, and depending on the length of the patterns, might have a running time and memory usage that won't work, but here's some code.

您可以粘贴以下code到 LINQPad 并运行它,它应该产生下面的输出:

You can paste the following code into LINQPad and run it, and it should produce the following output:


ABCBCABCBCDEEF = (2A(2BC))D(2E)F
ABBABBABBABA = (3A(2B))ABA
ABCDABCDCDCDCD = (2ABCD)(3CD)

正如你所看到的,中间的例子EN codeD ABB A(2B)而不是 ABB ,你就必须自己做出这样的判断,如果那样的单一符号序列应该是连接codeD作为一个重复的符号或没有,或如果一个特定的阈值(如3个或更多),应使用

As you can see, the middle example encoded ABB as A(2B) instead of ABB, you would have to make that judgment yourself, if single-symbol sequences like that should be encoded as a repeated symbol or not, or if a specific threshold (like 3 or more) should be used.

基本上,code运行是这样的:

Basically, the code runs like this:

  1. 为序列中的每一个位置,试图找到最长的匹配(实际上,它没有,它采取的第一个2 +匹配它发现,我离开其余作为练习你,因为我不得不离开我的电脑几个小时了)
  2. 然后,它会尝试连接code的顺序,一个一个重复,递归和吐出对象的X *序列类型
  3. 如果它不能找到一个重复序列,它吐出来的是一个符号在该位置
  4. 然后,它会跳过它连接codeD,并从#1
  5. 继续

总之,这里的code:

Anyway, here's the code:

void Main()
{
    string[] examples = new[]
    {
        "ABCBCABCBCDEEF",
        "ABBABBABBABA",
        "ABCDABCDCDCDCD",
    };

    foreach (string example in examples)
    {
        StringBuilder sb = new StringBuilder();
        foreach (var r in Encode(example))
            sb.Append(r.ToString());
        Debug.WriteLine(example + " = " + sb.ToString());
    }
}

public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values)
{
    return Encode<T>(values, EqualityComparer<T>.Default);
}

public static IEnumerable<Repeat<T>> Encode<T>(IEnumerable<T> values, IEqualityComparer<T> comparer)
{
    List<T> sequence = new List<T>(values);

    int index = 0;
    while (index < sequence.Count)
    {
        var bestSequence = FindBestSequence<T>(sequence, index, comparer);
        if (bestSequence == null || bestSequence.Length < 1)
            throw new InvalidOperationException("Unable to find sequence at position " + index);

        yield return bestSequence;
        index += bestSequence.Length;
    }
}

private static Repeat<T> FindBestSequence<T>(IList<T> sequence, int startIndex, IEqualityComparer<T> comparer)
{
    int sequenceLength = 1;
    while (startIndex + sequenceLength * 2 <= sequence.Count)
    {
        if (comparer.Equals(sequence[startIndex], sequence[startIndex + sequenceLength]))
        {
            bool atLeast2Repeats = true;
            for (int index = 0; index < sequenceLength; index++)
            {
                if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength + index]))
                {
                    atLeast2Repeats = false;
                    break;
                }
            }
            if (atLeast2Repeats)
            {
                int count = 2;
                while (startIndex + sequenceLength * (count + 1) <= sequence.Count)
                {
                    bool anotherRepeat = true;
                    for (int index = 0; index < sequenceLength; index++)
                    {
                        if (!comparer.Equals(sequence[startIndex + index], sequence[startIndex + sequenceLength * count + index]))
                        {
                            anotherRepeat = false;
                            break;
                        }
                    }
                    if (anotherRepeat)
                        count++;
                    else
                        break;
                }

                List<T> oneSequence = Enumerable.Range(0, sequenceLength).Select(i => sequence[startIndex + i]).ToList();
                var repeatedSequence = Encode<T>(oneSequence, comparer).ToArray();
                return new SequenceRepeat<T>(count, repeatedSequence);
            }
        }

        sequenceLength++;
    }

    // fall back, we could not find anything that repeated at all
    return new SingleSymbol<T>(sequence[startIndex]);
}

public abstract class Repeat<T>
{
    public int Count { get; private set; }

    protected Repeat(int count)
    {
        Count = count;
    }

    public abstract int Length
    {
        get;
    }
}

public class SingleSymbol<T> : Repeat<T>
{
    public T Value { get; private set; }

    public SingleSymbol(T value)
        : base(1)
    {
        Value = value;
    }

    public override string ToString()
    {
        return string.Format("{0}", Value);
    }

    public override int Length
    {
        get
        {
            return Count;
        }
    }
}

public class SequenceRepeat<T> : Repeat<T>
{
    public Repeat<T>[] Values { get; private set; }

    public SequenceRepeat(int count, Repeat<T>[] values)
        : base(count)
    {
        Values = values;
    }

    public override string ToString()
    {
        return string.Format("({0}{1})", Count, string.Join("", Values.Select(v => v.ToString())));
    }

    public override int Length
    {
        get
        {
            int oneLength = 0;
            foreach (var value in Values)
                oneLength += value.Length;
            return Count * oneLength;
        }
    }
}

public class GroupRepeat<T> : Repeat<T>
{
    public Repeat<T> Group { get; private set; }

    public GroupRepeat(int count, Repeat<T> group)
        : base(count)
    {
        Group = group;
    }

    public override string ToString()
    {
        return string.Format("({0}{1})", Count, Group);
    }

    public override int Length
    {
        get
        {
            return Count * Group.Length;
        }
    }
}

这篇关于无损分级游程编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆