嵌入式使用的轻量级(解)压缩算法 [英] Lightweight (de)compression algorithm for embedded use

查看:34
本文介绍了嵌入式使用的轻量级(解)压缩算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有图形用户界面的低资源嵌入式系统.该接口需要字体数据.为了节省只读内存(闪存),需要压缩字体数据.我正在为此寻找一种算法.

待压缩数据的属性

  • 每像素 8 位矩形像素图的透明度数据
  • 一种字体中通常有大约 200..300 个字形(以特定大小采样的字体)
  • 每个字形的大小通常为 6x9 到 15x20 像素
  • 有很多零(无墨水")和略少的 255(完全墨水"),否则八位字节的分布非常均匀,因为抗锯齿的性质

压缩算法要求

  • 解压缩算法的重要指标是数据大小加上算法大小(因为它们将驻留在相同的有限内存中).
  • 可用于解压的 RAM 非常少;可以将单个字形的数据解压到 RAM 中,但不能解压更多.
  • 为了让事情变得更加困难,该算法在 32 位微控制器(ARM Cortex-M 内核)上必须非常快,因为在将字形绘制到显示器上时需要对其进行解压缩.每个八位字节十或二十个机器周期是可以的,一百个肯定太多了.
  • 为了让事情变得更简单,完整的数据语料库是先验已知的,并且在压缩阶段有大量的处理能力和内存可用.

结论和想法

  • 由于相对较高的熵,仅通过某种可变长度编码打包每个八位字节的幼稚方法不会产生良好的结果.
  • 任何利用较早解压数据的算法似乎都没有问题,因为不可能存储其他字形的解压数据.这会降低 LZ 算法的效率,因为它们只能引用少量数据.
  • 处理能力的限制似乎排除了大多数按位运算,即解压缩应逐个八位字节处理数据.这使得 Huffman 编码变得困难并且无法进行算术编码.
  • 这个问题似乎是静态字典编码的一个很好的候选者,因为所有数据都是事先知道的,而且数据本质上有些重复(不同的字形共享相同的形状).

问题

  • 怎样才能构造出好的词典?我知道为某些数据找到最佳字典是一个 np 完全问题,但是否有任何合理的近似值?我尝试过 zstandard 的字典生成器,但结果不是很好.
  • 我的结论是否有错误?(我是否在错误的轨道上并忽略了一些明显的东西?)

目前最好的算法

只是提供一些背景信息,我能够找出的最有用的算法如下:

  • 单个字形的字体数据中的所有样本都连接(展平)为一维数组(向量、表格).
  • 每个样本都有三种可能的状态:0、255 和其他".
  • 此信息一次将五个连续样本打包成一个 5 位基数的数字 (0..3^5).
  • 由于八位字节中有一些额外的值(2^8 = 256、3^5 = 243),它们用于表示由 0 和 255 组成的较长字符串.
  • 对于每个其他"值,实际值 (1..254) 存储在单独的向量中.

此数据解压速度很快,因为可以通过较小的(243 x 3 = 729 个八位字节)查找表将 base-3 值解码为 base-4 值.压缩率高度依赖于字体大小,但使用我的典型数据,我可以得到大约 1:2.由于这明显比 LZ 变体(大约为 1:3)差,因此我想尝试使用静态字典方法.

当然,通常的LZ变种使用的是霍夫曼或算术编码,自然会使压缩后的数据变小.另一方面,我拥有所有可用数据,压缩速度不是问题.这应该可以找到更好的词典.

由于数据的性质,我可以使用有损算法,但在这种情况下,最有可能的有损算法是减少像素数据中的量化级别数.这不会对潜在的压缩问题产生太大影响,而且我想避免由此产生的位对齐麻烦.

解决方案

我承认这是一个很好回答我的问题的临界案例,但是当我对这个问题进行了一些研究时,这个答案都描述了该方法如果有人遇到问题,我会选择并提供有关问题性质的更多信息.

正确答案"又名最终算法

我最终得到的是我在问题中描述的内容的变体.首先,每个字形被分成三等分 0、1 和中间.然后用一个 256 槽的静态字典压缩这个三元信息.字典(或查找表)中的每一项都是一个二进制编码的字符串(0=0、10=1、11=中间),在最重要的一端添加一个 1.

灰度数据(用于中间色块)散布在对查找表的引用之间.所以,数据基本上是这样的:

<预><代码><灰度值><灰度值>...

灰度值的数量自然取决于从静态字典中查找的三元数据中的中间trit数.

解压代码很短,可以很容易地写成一个状态机,只有一个指针和一个给出状态的 32 位变量.像这样:

static uint32_t trits_to_decode;静态 uint8_t *next_octet;/* 这应该在开始解码字形时调用data : 指向压缩字形数据的指针 */void start_glyph(uint8_t *data){next_octet = 数据;//将指针设置为字形的开头trits_to_decode = 1;//这会触发重新加载一个新的字典项}/* 该函数返回下一个 8 位像素值 */uint8_t next_pixel(void){uint8_t 返回值;//仅结束哨兵?如果是这样,我们就没有三元数据了如果(trits_to_decode == 1)//获取下一个三元字典项trits_to_decode = 字典[*next_octet++];//从三元词中获取下一个像素//检查 LSB 位如果(trits_to_decode & 1){trits_to_decode >>= 1;//全值或灰度值,检查下一位如果(trits_to_decode & 1){trits_to_decode >>= 1;//灰度值;从缓冲区获取下一个返回 *next_octet++;}//如果我们在这里,它是一个完整的值trits_to_decode >>= 1;返回 255;}//我们有一个零,返回它trits_to_decode >>= 1;返回0;}

(代码没有完全按照这种形式进行测试,所以可能会有拼写错误或其他愚蠢的小错误.)

移位操作有很多重复.我不太担心,因为编译器应该能够清理它.(实际上,左移可能会更好,因为这样可以在移位后使用进位位.但由于在 C 中没有直接的方法可以做到这一点,所以我不打扰.)

另一个优化与字典(查找表)的大小有关.可能有短项和长项,因此可以构建它以支持 32 位、16 位或 8 位项.在这种情况下,必须对字典进行排序,以便小数值表示 32 位项目,中间值表示 16 位项目,大值表示 8 位项目,以避免对齐问题.然后查找代码是这样的:

static uint8_t dictionary_lookup(uint8_t octet){如果(八位字节 

当然,如果每个字体都有自己的字典,常量就会变成从字体信息中查找的变量.任何中规中矩的编译器都会内联该函数,因为它只会被调用一次.

如果减少量化级别的数量,也可以处理.最简单的情况是使用 4 位灰度级 (1..14).这需要一个 8 位状态变量来保持灰度级.那么灰度分支就会变成:

//新状态值静态 uint8_t 灰度值;...//next_pixel() 函数中的新变量uint8_t 返回值;...//没有可用的旧灰度值?如果(灰度值== 0)gray_value = *next_octet++;//提取低半字节return_value = gray_value &0x0f;//将高半字节移到低半字节灰度值>>=4;返回返回值;

这实际上允许使用 15 个中间灰度级(总共 17 个级别),可以很好地映射到线性 255 值系统.

三或五位数据更容易打包成 16 位半字并将 MSB 设置为始终为 1.然后可以使用与三元数据相同的技巧(移位直到得到 1).

应该注意的是,压缩比在某个时候开始恶化.三元数据的压缩量与灰度级数无关.灰度数据未压缩,八位字节的数量(几乎)与位数成线性关系.对于典型字体,8 位灰度数据是总数的 1/2 .. 2/3,但这在很大程度上取决于字体和大小.

因此,从 8 位减少到 4 位(在大多数情况下这在视觉上非常难以察觉)通常会将压缩大小减少 1/4..1/3,而通过降低到 3 位所提供的进一步减少要小得多.两位数据对于这种压缩算法没有意义.

如何构建字典?

如果解压算法非常简单和快速,那么真正的挑战在于字典构建.很容易证明存在最佳字典(字典为给定字体提供最少数量的压缩八位字节),但比我更聪明的人似乎已经证明找到这样的字典的问题是 NP 完全的.

由于我在该领域相当缺乏理论知识,我认为会有很好的工具提供相当好的近似值.可能有这样的工具,但我找不到,所以我推出了自己的 mickeymouse 版本.较早的算法相当愚蠢;找到了一个更简单有效的方法

  1. 从一个包含 '0'、g'、'1' 的静态字典开始(其中 'g' 表示中间值)
  2. 将每个字形的三元数据拆分为一个 trits 列表
  3. 找到最常见的连续项组合(在第一次迭代时很可能是0"、0")
  4. 用组合替换所有出现的组合并将组合添加到字典中(例如,数据'0'、'1'、'0'、'0'、'g'将变为'0'、'1', '00', 'g' 如果 '0', '0' 替换为 '00')
  5. 删除字典中任何未使用的项目(它们至少在理论上可能会出现)
  6. 重复第 3-5 步,直到字典填满(即至少 253 轮)

这仍然是一种非常简单的方法,它可能会给出非常次优的结果.它唯一的优点是它有效.

效果如何?

一个答案就足够了,但要详细说明一下,这里有一些数字.这是一种具有 864 个字形、典型字形大小为 14x11 像素和每像素 8 位的字体.

  • 未压缩的原始大小:127101
  • 中间值的数量:46697
  • 香农熵(逐个八位组):
    • 总计:528914 位 = 66115 个八位字节
    • 三进制数据:176405 位 = 22051 个八位字节
    • 中间值:352509 位 = 44064 个八位字节
  • 简单压缩的三元数据(0=0、10=1、11=中间)(127101 个三元组):207505 位 = 25939 个八位字节
  • 字典压缩的三元数据:18492 个八位字节
    • 熵:136778 位 = 17097 个八位字节
  • 字典大小:647 个八位字节
  • 完整压缩数据:647 + 18492 + 46697 = 65836 个八位字节
  • 压缩率:48.2%

与逐个八位组熵的比较非常具有启发性.中间值数据具有高熵,而三元数据可以压缩.这也可以通过原始数据中的大量值 0 和 255(与任何中间值相比)来解释.

我们没有做任何压缩中间值的事情,因为似乎没有任何有意义的模式.然而,我们用三元数据以明显的优势击败了熵,甚至数据总量都低于熵极限.所以,我们可以做得更糟.

将量化级别的数量减少到 17 会将数据大小减少到大约 42920 个八位字节(压缩率超过 66%).熵是 41717 个八位字节,因此算法如预期的那样变得稍微差一些.

实际上,较小的字体大小很难压缩.这应该不足为奇,因为大部分信息都在灰度信息中.使用此算法可以有效地压缩非常大的字体大小,但运行长度压缩是一个更好的候选者.

什么会更好?

如果我知道,我会使用它!但我仍然可以推测.

Jubatian 表示字体中会有很多重复.这对于变音符号来说一定是正确的,因为 aàäáâå 在几乎所有字体中都有很多共同点.但是,大多数字体中的 p 和 b 等字母似乎并非如此.虽然基本形状很接近,但还不够.(仔细的逐像素字体设计是另一回事.)

不幸的是,这种不可避免的重复在较小尺寸的字体中不太容易被利用.我尝试创建一个包含所有可能扫描线的字典,然后只引用这些.不幸的是,不同扫描线的数量很多,因此参考增加的开销超过了好处.如果扫描线本身可以被压缩,情况会有所改变,但每条扫描线的八位字节数量很少,这使得有效压缩变得困难.这个问题当然取决于字体大小.

我的直觉告诉我,如果使用比全扫描线更长和更短的运行,这仍然是正确的方法.这与使用 4 位像素相结合可能会产生非常好的结果——前提是有一种方法可以创建最佳字典.

这个方向的一个提示是,完整字体数据(127101 个八位字节)的 LZMA2 压缩文件(xz 处于最高压缩)只有 36720 个八位字节.当然,这种格式不满足任何其他要求(解压速度快,可以逐字形解压缩,RAM 要求低),但它仍然显示数据中的冗余比我的廉价算法能够做到的更多利用.

字典编码通常在字典步骤之后与霍夫曼或算术编码相结合.我们不能在这里做,但如果可以,它会再节省 4000 个八位字节.

I have a low-resource embedded system with a graphical user interface. The interface requires font data. To conserve read-only memory (flash), the font data needs to be compressed. I am looking for an algorithm for this purpose.

Properties of the data to be compressed

  • transparency data for a rectangular pixel map with 8 bits per pixel
  • there are typically around 200..300 glyphs in a font (typeface sampled in certain size)
  • each glyph is typically from 6x9 to 15x20 pixels in size
  • there are a lot of zeros ("no ink") and somewhat less 255's ("completely inked"), otherwise the distribution of octets is quite even due to the nature of anti-aliasing

Requirements for the compression algorithm

  • The important metrics for the decompression algorithm is the size of the data plus the size of the algorithm (as they will reside in the same limited memory).
  • There is very little RAM available for the decompression; it is possible to decompress the data for a single glyph into RAM but not much more.
  • To make things more difficult, the algorithm has to be very fast on a 32-bit microcontroller (ARM Cortex-M core), as the glyphs need to be decompressed while they are being drawn onto the display. Ten or twenty machine cycles per octet is ok, a hundred is certainly too much.
  • To make things easier, the complete corpus of data is known a priori, and there is a lot of processing power and memory available during the compression phase.

Conclusions and thoughts

  • The naïve approach of just packing each octet by some variable-length encoding does not give good results due to the relatively high entropy.
  • Any algorithm taking advantage of data decompressed earlier seems to be out of question as it is not possible to store the decompressed data of other glyphs. This makes LZ algorithms less efficient as they can only reference to a small amount of data.
  • Constraints on the processing power seem to rule out most bitwise operations, i.e. decompression should handle the data octet-by-octet. This makes Huffman coding difficult and arithmetic coding impossible.
  • The problem seems to be a good candidate for static dictionary coding, as all data is known beforehand, and the data is somewhat repetitive in nature (different glyphs share same shapes).

Questions

  • How can a good dictionary be constructed? I know finding the optimal dictionary for certain data is a np complete problem, but are there any reasonably good approximations? I have tried the zstandard's dictionary builder, but the results were not very good.
  • Is there something in my conclusions that I've gotten wrong? (Am I on the wrong track and omitting something obvious?)

Best algorithm this far

Just to give some background information, the best useful algorithm I have been able to figure out is as follows:

  • All samples in the font data for a single glyph are concatenated (flattened) into a one-dimensional array (vector, table).
  • Each sample has three possible states: 0, 255, and "something else".
  • This information is packed five consecutive samples at a time into a 5-digit base-three number (0..3^5).
  • As there are some extra values available in an octet (2^8 = 256, 3^5 = 243), they are used to signify longer strings of 0's and 255's.
  • For each "something else" value the actual value (1..254) is stored in a separate vector.

This data is fast to decompress, as the base-3 values can be decoded into base-4 values by a smallish (243 x 3 = 729 octets) lookup table. The compression ratios are highly dependent on the font size, but with my typical data I can get around 1:2. As this is significantly worse than LZ variants (which get around 1:3), I would like to try the static dictionary approach.

Of course, the usual LZ variants use Huffman or arithmetic coding, which naturally makes the compressed data smaller. On the other hand, I have all the data available, and the compression speed is not an issue. This should make it possible to find much better dictionaries.

Due to the nature of the data I could be able to use a lossy algorithm, but in that case the most likely lossy algorithm would be reducing the number of quantization levels in the pixel data. That won't change the underlying compression problem much, and I would like to avoid the resulting bit-alignment hassle.

解决方案

I do admit that this is a borderline case of being a good answer to my question, but as I have researched the problem somewhat, this answer both describes the approach I chose and gives some more information on the nature of the problem should someone bump into it.

"The right answer" a.k.a. final algorithm

What I ended up with is a variant of what I describe in the question. First, each glyph is split into trits 0, 1, and intermediate. This ternary information is then compressed with a 256-slot static dictionary. Each item in the dictionary (or look-up table) is a binary encoded string (0=0, 10=1, 11=intermediate) with a single 1 added to the most significant end.

The grayscale data (for the intermediate trits) is interspersed between the references to the look-up table. So, the data essentially looks like this:

<LUT reference><gray value><gray value><LUT reference>...

The number of gray scale values naturally depends on the number of intermediate trits in the ternary data looked up from the static dictionary.

Decompression code is very short and can easily be written as a state machine with only one pointer and one 32-bit variable giving the state. Something like this:

static uint32_t trits_to_decode;
static uint8_t *next_octet;

/* This should be called when starting to decode a glyph
   data : pointer to the compressed glyph data */
void start_glyph(uint8_t *data)
{
    next_octet = data;        // set the pointer to the beginning of the glyph
    trits_to_decode = 1;      // this triggers reloading a new dictionary item
}


/* This function returns the next 8-bit pixel value */
uint8_t next_pixel(void)
{
    uint8_t return_value;

    // end sentinel only? if so, we are out of ternary data
    if (trits_to_decode == 1)
        // get the next ternary dictionary item
        trits_to_decode = dictionary[*next_octet++];

    // get the next pixel from the ternary word
    // check the LSB bit(s)
    if (trits_to_decode & 1)
    {
        trits_to_decode >>= 1;
        // either full value or gray value, check the next bit
        if (trits_to_decode & 1)
        {
            trits_to_decode >>= 1;
            // grayscale value; get next from the buffer
            return *next_octet++; 
        }
        // if we are here, it is a full value
        trits_to_decode >>= 1;
        return 255;
    }

    // we have a zero, return it
    trits_to_decode >>= 1;
    return 0;
}

(The code has not been tested in exactly this form, so there may be typos or other stupid little errors.)

There is a lot of repetition with the shift operations. I am not too worried, as the compiler should be able to clean it up. (Actually, left shift could be even better, because then the carry bit could be used after shifting. But as there is no direct way to do that in C, I don't bother.)

One more optimization relates to the size of the dictionary (look-up table). There may be short and long items, and hence it can be built to support 32-bit, 16-bit, or 8-bit items. In that case the dictionary has to be ordered so that small numerical values refer to 32-bit items, middle values to 16-bit items and large values to 8-bit items to avoid alignment problems. Then the look-up code looks like this:

static uint8_t dictionary_lookup(uint8_t octet)
{
    if (octet < NUMBER_OF_32_BIT_ITEMS)
        return dictionary32[octet];
    if (octet < NUMBER_OF_32_BIT_ITEMS + NUMBER_OF_16_BIT_ITEMS)
        return dictionary16[octet - NUMBER_OF_32_BIT_ITEMS];
    return dictionary8[octet - NUMBER_OF_16_BIT_ITEMS - NUMBER_OF_32_BIT_ITEMS];
}

Of course, if every font has its own dictionary, the constants will become variables looked up form the font information. Any half-decent compiler will inline that function, as it is called only once.

If the number of quantization levels is reduced, it can be handled, as well. The easiest case is with 4-bit gray levels (1..14). This requires one 8-bit state variable to hold the gray levels. Then the gray level branch will become:

// new state value
static uint8_t gray_value;
...

    // new variable within the next_pixel() function
    uint8_t return_value;

    ...

            // there is no old gray value available?
            if (gray_value == 0)
                gray_value = *next_octet++;
            // extract the low nibble
            return_value = gray_value & 0x0f;
            // shift the high nibble into low nibble
            gray_value >>= 4;
            return return_value;

This actually allows using 15 intermediate gray levels (a total of 17 levels), which maps very nicely into linear 255-value system.

Three- or five-bit data is easier to pack into a 16-bit halfword and set MSB always one. Then the same trick as with the ternary data can be used (shift until you get 1).

It should be noted that the compression ratio starts to deteriorate at some point. The amount of compression with the ternary data does not depend on the number of gray levels. The gray level data is uncompressed, and the number of octets scales (almost) linearly with the number of bits. For a typical font the gray level data at 8 bits is 1/2 .. 2/3 of the total, but this is highly dependent on the typeface and size.

So, reduction from 8 to 4 bits (which is visually quite imperceptible in most cases) reduces the compressed size typically by 1/4..1/3, whereas the further reduction offered by going down to three bits is significantly less. Two-bit data does not make sense with this compression algorithm.

How to build the dictionary?

If the decompression algorithm is very straightforward and fast, the real challenges are in the dictionary building. It is easy to prove that there is such thing as an optimal dictionary (dictionary giving the least number of compressed octets for a given font), but wiser people than me seem to have proven that the problem of finding such dictionary is NP-complete.

With my arguably rather lacking theoretical knowledge on the field I thought there would be great tools offering reasonably good approximations. There might be such tools, but I could not find any, so I rolled my own mickeymouse version. EDIT: the earlier algorithm was rather goofy; a simpler and more effective was found

  1. start with a static dictionary of '0', g', '1' (where 'g' signifies an intermediate value)
  2. split the ternary data for each glyph into a list of trits
  3. find the most common consecutive combination of items (it will most probably be '0', '0' at the first iteration)
  4. replace all occurrences of the combination with the combination and add the combination into the dictionary (e.g., data '0', '1', '0', '0', 'g' will become '0', '1', '00', 'g' if '0', '0' is replaced by '00')
  5. remove any unused items in the dictionary (they may occur at least in theory)
  6. repeat steps 3-5 until the dictionary is full (i.e. at least 253 rounds)

This is still a very simplistic approach and it probably gives a very sub-optimal result. Its only merit is that it works.

How well does it work?

One answer is well enough, but to elaborate that a bit, here are some numbers. This is a font with 864 glyphs, typical glyph size of 14x11 pixels, and 8 bits per pixel.

  • raw uncompressed size: 127101
  • number of intermediate values: 46697
  • Shannon entropies (octet-by-octet):
    • total: 528914 bits = 66115 octets
    • ternary data: 176405 bits = 22051 octets
    • intermediate values: 352509 bits = 44064 octets
  • simply compressed ternary data (0=0, 10=1, 11=intermediate) (127101 trits): 207505 bits = 25939 octets
  • dictionary compressed ternary data: 18492 octets
    • entropy: 136778 bits = 17097 octets
  • dictionary size: 647 octets
  • full compressed data: 647 + 18492 + 46697 = 65836 octets
  • compression: 48.2 %

The comparison with octet-by-octet entropy is quite revealing. The intermediate value data has high entropy, whereas the ternary data can be compressed. This can also be interpreted by the high number of values 0 and 255 in the raw data (as compared to any intermediate values).

We do not do anything to compress the intermediate values, as there do not seem to be any meaningful patterns. However, we beat entropy by a clear margin with ternary data, and even the total amount of data is below entropy limit. So, we could do worse.

Reducing the number of quantization levels to 17 would reduce the data size to approximately 42920 octets (compression over 66 %). The entropy is then 41717 octets, so the algorithm gets slightly worse as is expected.

In practice, smaller font sizes are difficult to compress. This should be no surprise, as larger fraction of the information is in the gray scale information. Very big font sizes compress efficiently with this algorithm, but there run-length compression is a much better candidate.

What would be better?

If I knew, I would use it! But I can still speculate.

Jubatian suggests there would be a lot of repetition in a font. This must be true with the diacritics, as aàäáâå have a lot in common in almost all fonts. However, it does not seem to be true with letters such as p and b in most fonts. While the basic shape is close, it is not enough. (Careful pixel-by-pixel typeface design is then another story.)

Unfortunately, this inevitable repetition is not very easy to exploit in smaller size fonts. I tried creating a dictionary of all possible scan lines and then only referencing to those. Unfortunately, the number of different scan lines is high, so that the overhead added by the references outweighs the benefits. The situation changes somewhat if the scan lines themselves can be compressed, but there the small number of octets per scan line makes efficient compression difficult. This problem is, of course, dependent on the font size.

My intuition tells me that this would still be the right way to go, if both longer and shorter runs than full scan lines are used. This combined with using 4-bit pixels would probably give very good results—only if there were a way to create that optimal dictionary.

One hint to this direction is that LZMA2 compressed file (with xz at the highest compression) of the complete font data (127101 octets) is only 36720 octets. Of course, this format fulfils none of the other requirements (fast to decompress, can be decompressed glyph-by-glyph, low RAM requirements), but it still shows there is more redundance in the data than what my cheap algorithm has been able to exploit.

Dictionary coding is typically combined with Huffman or arithmetic coding after the dictionary step. We cannot do it here, but if we could, it would save another 4000 octets.

这篇关于嵌入式使用的轻量级(解)压缩算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆