收缩字符串编码算法 [英] Shrink string encoding algorithm

查看:137
本文介绍了收缩字符串编码算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将20个字母的字符串缩小/编码为6个字母。我发现很少的算法处理数据压缩,如RLE,算术编码,通用代码,但没有一个保证6个字母。



原始字符串可以包含字符AZ(大写) ,0-9 ans a dash。

解决方案

如果您的目标是无损地压缩或散列20个字符的随机输入字符串(每个字符可以是[AZ],[0-9]或 - )到6个字符的输出字符串。 理论上是不可能的。



在。



在您的情况下,假设您以最均匀的方式对其进行哈希(否则会更糟),对于每个输入值,平均将有5.26个其他输入值与它共享相同的哈希密钥。通过生日攻击,我们可以预期在约2亿次试验中发现碰撞。可以在普通的笔记本电脑不到10秒内完成。



但是,如果你坚持这样做,你可能想要阅读哈希函数算法。它列出了很多算法供您选择。祝你好运!


How do we shrink/encode a 20 letter string to 6 letters. I found few algorithms address data compression like RLE, Arithmetic coding, Universal code but none of them guarantees 6 letters.

The original string can contain the characters A-Z (upper case), 0-9 ans a dash.

解决方案

If your goal is to losslessly compress or hash an random input string of 20 characters (each character could be [A-Z], [0-9] or -) to an output string of 6 characters. It's theoretically impossible.

In information theory, given a discrete random variable X={x|x1,...,xn}, the Shannon entropy H(X) is defined as:

where p(xi) is the probablity of X = xi. In your case, X has 20 of 37 possible characters, so it could be {x|x1,...,xn} where n = 37^20. Supposing the 37 characters have the same probability of being (aka the input string is random), then p(xi) = 1/37^20. So the Shannon entropy of the input is:

. A char in common computer can hold 8 bit, so that 6 chars can hold 48 bit. There's no way to hold 104 bit information by 6 chars. You need at least 15 chars to hold it instead.


If you do allow the loss and have to hash the 20 chars into 6 chars, then your are trying to hash 37^20 values to 128^6 keys. It could be done, but you would got plenty of hash collisions.

In your case, supposing you hash them with the most uniformity (otherwise it would be worse), for each input value, there would be by average of 5.26 other input values sharing the same hash key with it. By a birthday attack, we could expect to find a collision within approximately 200 million trials. It could be done in less than 10 seconds by a common laptop. So I don't think this would be a safe hashing.

However if you insist to do that, you might want to read Hash function algorithms. It lists a lot of algorithms for your choice. Good luck!

这篇关于收缩字符串编码算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆