复杂性压缩字符串 [英] complexity compression string
问题描述
我有一个大的理论字符串(104个字符长)数据库生成程序,返回以PB级为单位测量的结果。我没有那么多的计算能力,所以我想从数据库中过滤低复杂度的字符串。
I have a large theoretical string (104 characters long) database generation program that returns results measured in petabytes. I don't have that much computing power so I would like to filter the low complexity strings from the database.
我的语法是没有数字字符的英语字母表的修改形式。我读到柯尔莫哥洛夫复杂性以及它是如何的理论上无法计算但我在使用压缩只需要一些基本的东西在C#。
My grammer is a modified form of the English alphabet with no numerical characters. I read about Kolmogorov Complexity and how it is theoretically impossible to calculate but I just need something basic in C# using compression.
使用这两个环节
- How to measure complexity of a string?
- How to determine size of string, and compress it
我想出了这个:
MemoryStream ms = new MemoryStream();
GZipStream gzip2 = new GZipStream(ms, CompressionMode.Compress, true);
byte[] raw = Encoding.UTF8.GetBytes(element);
gzip2.Write(raw, 0, raw.Length);
gzip2.Close();
byte[] zipped = ms.ToArray(); // as a BLOB
string smallstring = Convert.ToString(zipped); // as a string
// store zipped or base64
byte[] raw2 = Encoding.UTF8.GetBytes(smallstring);
int startsize = raw.Length;
int finishsize = raw2.Length;
double percent = Convert.ToDouble(finishsize) / Convert.ToDouble(startsize);
if (percent > .75)
{
///output
}
我的第一个元素是:
HHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFH
HHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFH
和其压缩到 13 字符finishsize但这其他chatcter集
and it compresses to a finishsize of 13 characters but this other chatcter set
mlcllltlgvalvcgvpamdipqtkqdlelpklagtwhsmamatnnislmatlkaplrvhitsllptpednleivlhrwennscvekkvlgektenpkkfkinytvaneatlldtdydnflflclqdtttpiqsmmcqylarvlveddeimqgfirafrplprhlwylldlkqmeepcrf
mlcllltlgvalvcgvpamdipqtkqdlelpklagtwhsmamatnnislmatlkaplrvhitsllptpednleivlhrwennscvekkvlgektenpkkfkinytvaneatlldtdydnflflclqdtttpiqsmmcqylarvlveddeimqgfirafrplprhlwylldlkqmeepcrf
也计算结果为 13 。有一个错误,但我不知道如何解决它。
also evaluates to 13. There is a bug but I don't know how to fix it.
推荐答案
您的错误是以下部分, array into a string:
Your bug is the following part where you convert the array into a string:
byte[] zipped = ms.ToArray(); // as a BLOB
string smallstring = Convert.ToString(zipped); // as a string
// store zipped or base64
byte[] raw2 = Encoding.UTF8.GetBytes(smallstring);
调用 Convert.ToString()
数组将返回一些调试输出,在这种情况下为字符串 System.Byte []
。 (请参阅 ideone 上的以下示例。)
Calling Convert.ToString()
on an array will return some debugging output, in this case the string System.Byte[]
. (See the following example on ideone.)
您应该直接比较未压缩和压缩的字节数组的长度:
You should compare the lengths of the uncompressed and compressed byte array directly:
int startsize = raw.Length;
int finishsize = zipped.Length;
这篇关于复杂性压缩字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!