优化字节编码对 [英] optimizing byte-pair encoding

查看:151
本文介绍了优化字节编码对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到字节对编码(BPE)是非常的大的文本COM pression标杆缺乏,我很 迅速作出 一个简单的字面实现了。

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.

的玉米pression比率 - 考虑到没有进一步的处理,例如没有霍夫曼或算术编码 - 出奇的好

The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.

我简单的实现运行时间小于恒星,但是。

The runtime of my trivial implementation was less than stellar, however.

这又如何优化?是否有可能做一个单一的通?

How can this be optimized? Is it possible to do it in a single pass?

推荐答案

这是我进步的总结至今:

This is a summary of my progress so far:

谷歌搜索发现这个的小报告链接到原来的code和引用来源:

Googling found this little report that links to the original code and cites the source:

菲利普·盖奇,题为一种新算法   对于数据通信pression,即出现   在将C用户日报 - 二月   1994年版。

Philip Gage, titled 'A New Algorithm for Data Compression', that appeared in 'The C Users Journal' - February 1994 edition.

要在医生多布斯网站上的code的联系被破坏,但该网页反映他们。

The links to the code on Dr Dobbs site are broken, but that webpage mirrors them.

这code使用的的表来跟踪所用的连字和它们的计数每次通过缓冲区,从而避免重复计算新鲜每遍。

That code uses a hash table to track the the used digraphs and their counts each pass over the buffer, so as to avoid recomputing fresh each pass.

我的测试数据是 enwik8 从<一个HREF =htt​​p://prize.hutter1.net/相对=nofollow>胡特奖。

|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2          | 1.24            | //The current version in the large text benchmark
| bpe_c          | 1.07            | //The original version by Gage, using a hashtable
| bpev3          | 0.25            | //Uses a list, custom sort, less memcpy
|----------------|-----------------|

bpev3 创建所有图的列表;块的大小10KB,并且还有(4,这是我们可以通过玉米pressing获得字节最小)通常为200左右二合字母高于阈值;这个列表进行排序,并在第一subsitution制成。

bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.

作为替换码,统计信息被更新;通常,每个通有大约只有10或20有向图改变;这些'涂'和排序,再与有向图列表合并;这实质上不仅仅是总是排序整个合字母列表中的每个传递更快,因为该名单的的排序。

As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.

原来的code一个TMP和BUF字节的缓冲区之间移动; bpev3只是交换缓冲区指针,这是值得大约10秒的运行时一个人。

The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.

由于缓冲区交换修复,以bpev2会带来与哈希表行版本的穷举搜索;我认为哈希表是有争议的价值,那名单是这个问题的一个更好的结构。

Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.

其门槛多通不过。所以它不是一个一般竞争力的算法。

Its sill multi-pass though. And so its not a generally competitive algorithm.

如果你看一下大文本的COM pression基准,原来 BPE 已被添加。由于它的更大的块大小,它的性能比我更好的BPE上enwik9。此外,哈希表和我的名单之间的性能差距是非常接近 - 我把那下到行军= PentiumPro 的LTCB使用。

If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the march=PentiumPro that the LTCB uses.

有当然场合是合适的和使用; <一href="http://developer.symbian.org/xref/oss/xref/MCL/sf/os/buildtools/toolsandutils/e32tools/com$p$pss/byte_pair.h"相对=nofollow>塞班使用它的ROM镜像COM pressing页面。我推测,拇指二进制的16位,这样做的一个简单的和有意义的方式; COM pression完成在PC上,并DECOM pression在设备上完成的。

There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.

这篇关于优化字节编码对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆