排序文件以优化压缩效率 [英] Sorting a file to optimize for compression efficiency

查看:131
本文介绍了排序文件以优化压缩效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一些大型数据文件正在串联,压缩,然后发送到另一台服务器。压缩减少了到目标服务器的传输时间,因此我们可以在短时间内获得的文件越小越好。这是一个高度时间敏感的过程。

We have some large data files that are being concatenated, compressed, and then sent to another server. The compression reduces the transmission time to the destination server, so the smaller we can get the file in a short period of time, the better. This is a highly time-sensitive process.

数据文件包含许多由制表符分隔的文本行,并且行的顺序无关紧要。

The data files contain many rows of tab-delimited text, and the order of the rows does not matter.

我们注意到,当我们按第一个字段对文件进行排序时,压缩文件的大小要小得多,大概是因为该列的重复项是彼此相邻的。但是,对大文件进行排序的速度很慢,没有真正的理由需要对其进行排序,而恰恰是它可以提高压缩率。第一列中的内容与后续列中的内容之间也没有任何关系。可能会有一些行的压缩顺序更小,或者有一种算法可以类似地提高压缩性能,但需要更少的运行时间。

We noticed that when we sorted the file by the first field, the compressed file size was much smaller, presumably because duplicates of that column are next to each other. However, sorting a large file is slow, and there's no real reason that it needs to be in sorted other than that it happens to improves compression. There's also no relationship between what's in the first column and what's in subsequent columns. There could be some ordering of rows that compressed even smaller, or alternatively there could be an algorithm that could similarly improve compression performance but require less time to run.

什么方法可以我用来对行进行重新排序以优化相邻行之间的相似性并提高压缩性能吗?

What approach could I use to reorder rows to optimize the similarity between neighboring rows and improve compression performance?

推荐答案

以下是一些建议:


  1. 将文件拆分为较小的批次并进行排序。对多个小数据集进行排序比对单个大数据块进行排序要快。您还可以通过这种方式轻松并行化工作。

  2. 使用不同的压缩算法进行实验。不同的算法具有不同的吞吐量和比率。您对这两个维度的稀疏边界上的算法感兴趣。

  3. 使用更大的字典大小。这样,压缩程序就可以引用过去的数据。

请注意,无论哪种算法和字典大小,排序都很重要您选择它是因为对旧数据的引用倾向于使用更多位。同样,按时间维度排序往往会将来自相似数据分布的行分组在一起。例如,Stack Overflow在晚上的机器人流量要多于白天。可能,HTTP日志中的UserAgent字段值分布会随着一天中的时间变化很大。

Note, that sorting is important no matter what algorithm and dictionary size you chose because references to old data tend to use more bits. Also, sorting by a time dimension tends to group rows together that come from a similar data distribution. For example, Stack Overflow has more bot traffic at night than during the day. Probably, the UserAgent field value distribution in their HTTP logs greatly varies with the time of day.

这篇关于排序文件以优化压缩效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆