链式哈希表和理解放气 [英] Chained Hash Table and understanding Deflate

查看:43
本文介绍了链式哈希表和理解放气的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试在C#中创建自定义 Deflate 实现.

I am currently trying to create a custom Deflate implementation in C#.

我目前正在尝试实现模式搜索"部分,该部分具有(最多)32k的数据,并试图为我的输入搜索尽可能最长的模式.

I am currently trying to implement the "pattern search" part where I have (up to) 32k of data and am trying to search the longest possible pattern for my input.

定义了Deflate的 RFC 1951 说:

The RFC 1951 which defines Deflate says about that process:

压缩程序使用链式哈希表查找重复的字符串,使用对3个字节的序列进行操作的哈希函数.在任何在压缩过程中给定的点,令XYZ为下一个3个输入字节被检查(当然不一定全部不同).首先,压缩器检查XYZ的哈希链.如果链条是空的,压缩器只是将X作为文字字节写出并前进一个输入中的字节.如果哈希链不为空,则表明序列XYZ(或者,如果不幸的话,其他3个字节相同的哈希函数值)最近发生,压缩器将XYZ哈希链上的所有字符串与实际输入数据进行比较从当前点开始的序列,并选择最长的匹配.

The compressor uses a chained hash table to find duplicated strings, using a hash function that operates on 3-byte sequences. At any given point during compression, let XYZ be the next 3 input bytes to be examined (not necessarily all different, of course). First, the compressor examines the hash chain for XYZ. If the chain is empty, the compressor simply writes out X as a literal byte and advances one byte in the input. If the hash chain is not empty, indicating that the sequence XYZ (or, if we are unlucky, some other 3 bytes with the same hash function value) has occurred recently, the compressor compares all strings on the XYZ hash chain with the actual input data sequence starting at the current point, and selects the longest match.

我确实知道什么是哈希函数,也知道什么是HashTable.但是什么是链式哈希表",又该如何设计这样的结构以在处理大量数据时高效(在C#中)?不幸的是,我不理解RFC中描述的结构如何工作.

I do know what a hash function is, and do know what a HashTable is as well. But what is a "chained hash table" and how could such a structure be designed to be efficient (in C#) with handling a large amout of data? Unforunately I didn't understand how the structure described in the RFC works.

我可以选择哪种哈希函数(有意义)?

What kind of hash function could I choose (what would make sense)?

提前谢谢!

推荐答案

链式哈希表是一种哈希表,它存储您放入其中的每个项目,即使两个项目的键哈希值相同,或者即使2个项目的密钥完全相同.

A chained hash table is a hash table that stores every item you put in it, even if the key for 2 items hashes to the same value, or even if 2 items have exactly the same key.

DEFLATE实现需要以无特定顺序存储一堆(键,数据)项,并使用该键快速查找所有项的列表.在这种情况下,密钥是连续3个字节的未压缩明文,并且数据是某种形式的指针或偏移量,指向明文中3字节子字符串所在的位置.

A DEFLATE implementation needs to store a bunch of (key, data) items in no particular order, and rapidly look-up a list of all the items with that key. In this case, the key is 3 consecutive bytes of uncompressed plaintext, and the data is some sort of pointer or offset to where that 3-byte substring occurs in the plaintext.

许多哈希表/字典实现都为每个项目存储密钥和数据.不必将密钥存储在表中以进行DEFLATE,但是除了在压缩过程中使用更多的内存外,它没有任何其他伤害.

Many hashtable/dictionary implementations store both the key and the data for every item. It's not necessary to store the key in the table for DEFLATE, but it doesn't hurt anything other than using slightly more memory during compression.

某些哈希表/字典实现(例如C ++ STL unordered_map )坚持要求它们存储的每个(键,数据)项都必须具有唯一的键.当您尝试存储另一个(键,数据)项,该键与表中已有的某些旧项具有相同的键时,这些实现将删除旧项并将其替换为新项.确实很痛-如果您不小心使用C ++ STL unordered_map 或类似的实现,则压缩文件将比使用更合适的库(例如C ++)更大STL hash_multimap .由于任何标准的DEFLATE压缩程序都可以将生成的(不必要大)压缩文件正确解压缩为与原始文件相同的文件,因此可能很难检测到该错误.DEFLATE和其他压缩算法的一些实现故意使用这种实现,故意牺牲压缩文件的大小以提高压缩速度.

Some hashtable/dictionary implementations such as the C++ STL unordered_map insist that every (key, data) item they store must have a unique key. When you try to store another (key, data) item with the same key as some older item already in the table, these implementations delete the old item and replace it with the new item. That does hurt -- if you accidentally use the C++ STL unordered_map or similar implementation, your compressed file will be larger than if you had used a more appropriate library such as the C++ STL hash_multimap. Such an error may be difficult to detect, since the resulting (unnecessarily large) compressed files can be correctly decompressed by any standard DEFLATE compressor to a file bit-for-bit identical to the original file. A few implementations of DEFLATE and other compression algorithms deliberately use such an implementation, deliberately sacrificing compressed file size in order to gain compression speed.

正如Nick Johnson所说,在标准的哈希表"或字典"实现中使用的默认哈希函数可能绰绰有余.

As Nick Johnson said, the default hash function used in your standard "hashtable" or "dictionary" implementation is probably more than adequate.

http://en.wikipedia.org/wiki/Hashtable#Separate_chaining

这篇关于链式哈希表和理解放气的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆