数据压缩如何比搜索性能的索引更有效? [英] How is data compression more effective than indexing for search performance?

查看:253
本文介绍了数据压缩如何比搜索性能的索引更有效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我们的应用程序,我们保持大量的数据索引三个整数列(源,类型和时间)。加载该数据的重要块可能需要一些时间,并且我们已经实现了各种措施以减少必须为较大查询搜索和加载的数据量,诸如为不需要高分辨率的查询存储较大粒度(时间

For our application, we keep large amounts of data indexed by three integer columns (source, type and time). Loading significant chunks of that data can take some time and we have implemented various measures to reduce the amount of data that has to be searched and loaded for larger queries, such as storing larger granularities for queries that don't require a high resolution (time-wise).

在我们的备份存档中搜索数据时,数据存储在bzipped文本文件中,但基本上具有相同的结构,我注意到明显更快地untar到stdout和管道它通过grep比解开到磁盘和grep文件。事实上,untar-to-pipe甚至比只打开未压缩的文件(即打折untar到磁盘)快得多。

When searching for data in our backup archives, where the data is stored in bzipped text files, but has basically the same structure, I noticed that it is significantly faster to untar to stdout and pipe it through grep than to untar it to disk and grep the files. In fact, the untar-to-pipe was even noticeably faster than just grepping the uncompressed files (i. e. discounting the untar-to-disk).

这让我想知道是否磁盘I / O的性能影响实际上比我想象的重。所以这里是我的问题:

This made me wonder if the performance impact of disk I/O is actually much heavier than I thought. So here's my question:

你认为将多行的数据放入单行的(压缩)blob字段,并搜索单行在提取期间可以比通过表索引搜索相同的行更快。

例如,不是具有此表

CREATE TABLE data ( `source` INT, `type` INT, `timestamp` INT, `value` DOUBLE);

我会有

CREATE TABLE quickdata ( `source` INT, `type` INT, `day` INT, `dayvalues` BLOB );

在quickdata中的每一行的数据中大约有100-300行,并在quickdata中搜索所需的时间戳

with approximately 100-300 rows in data for each row in quickdata and searching for the desired timestamps on the fly during decompression and decoding of the blob field.

这对你有意义吗?我应该调查什么参数?可能附加什么字符串?有什么数据库功能(任何DBMS)可以实现类似的效果?

Does this make sense to you? What parameters should I investigate? What strings might be attached? What DB features (any DBMS) exist to achieve similar effects?

推荐答案


磁盘I / O的性能影响实际上比我想象的重。

This made me wonder if the performance impact of disk I/O is actually much heavier than I thought.

绝对。如果你必须去磁盘,性能命中比内存大许多个数量级。这让我想起了经典的Jim Gray论文,分布式计算经济学

Definitely. If you have to go to disk, the performance hit is many orders of magnitude greater than memory. This reminds me of the classic Jim Gray paper, Distributed Computing Economics:


计算经济学正在改变。今天,(1)一个数据库访问,(2)十个字节的网络流量,(3)100,000个指令,(4)10个字节的磁盘存储和(5)兆字节的磁盘带宽之间存在粗略的价格平价。这对于如何构建互联网规模的分布式计算有影响:将计算尽可能接近数据,以避免昂贵的网络流量。

Computing economics are changing. Today there is rough price parity between (1) one database access, (2) ten bytes of network traffic, (3) 100,000 instructions, (4) 10 bytes of disk storage, and (5) a megabyte of disk bandwidth. This has implications for how one structures Internet-scale distributed computing: one puts computing as close to the data as possible in order to avoid expensive network traffic.

问题是,你有多少数据和多少内存?

The question, then, is how much data do you have and how much memory can you afford?

如果数据库真的很大 - 就像在没有人能负担得起这么多内存,即使在20年 - 你需要聪明的分布式数据库系统,如谷歌 BigTable Hadoop

这篇关于数据压缩如何比搜索性能的索引更有效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆