优化字节对编码 [英] optimizing byte-pair encoding

查看:287
本文介绍了优化字节对编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意到,大文本压缩非常缺乏字节对编码(BPE)基准,我非常 快速完成 一个琐碎的文字实现它。

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.

压缩比 - 考虑到没有进一步处理,例如没有哈夫曼或算术编码 - 令人惊讶的是好的。

The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.

然而,我的平凡实现的运行时却不如恒星。

The runtime of my trivial implementation was less than stellar, however.

如何优化?

推荐答案

这是我迄今为止进度的总结:

This is a summary of my progress so far:

Googling 找到了这个

Googling found this little report that links to the original code and cites the source:


Philip Gage,标题为A New Algorithm
用于数据压缩,在The C Users Journal - 2月
1994版中出现

Philip Gage, titled 'A New Algorithm for Data Compression', that appeared in 'The C Users Journal' - February 1994 edition.

该代码使用一个哈希表来跟踪

我的测试数据是 enwik8 > Hutter Prize 。

My test data is enwik8 from the Hutter Prize.

|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2          | 1.24            | //The current version in the large text benchmark
| bpe_c          | 1.07            | //The original version by Gage, using a hashtable
| bpev3          | 0.25            | //Uses a list, custom sort, less memcpy
|----------------|-----------------|

bpev3 创建所有有向图的列表;块大小为10KB,并且在阈值之上通常有200个左右的图(4,这是通过压缩可以获得一个字节的最小的);此列表排序,并进行第一次替换。

bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.

随着替换,更新统计数据;通常每次通过只有大约10或20个图更改;这些被绘制和排序,然后与有向图列表合并;这比每次通过排序整个有向图列表要快得多,因为列表是排序的

As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.

原始代码在'tmp'和'buf'字节缓冲区; bpev3只是交换缓冲区指针,这是值得大约10秒的运行时。

The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.

给定bpev2的缓冲区交换定位将带来穷举搜索符合hashtable版本;我认为散列表是可争论的价值,并且列表是这个问题的更好的结构。

Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.

它的窗台多遍。因此,它不是一个通用的竞争算法。

Its sill multi-pass though. And so its not a generally competitive algorithm.

如果你看看大文本压缩基准,原始的 bpe 已添加。因为它更大的块大小,它的性能比我的bpe on enwik9。此外,哈希表和我的列表之间的性能差距更加接近 - 我把它归结到LTCB使用的 march = PentiumPro

If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the march=PentiumPro that the LTCB uses.

当然有适合和使用的场合; Symbian 使用它来压缩ROM映像中的页面。我推测Thumb二进制的16位性质使这是一个简单和有价值的方法;压缩在PC上完成,解压缩在设备上完成。

There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.

这篇关于优化字节对编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆