有没有一种方法来存储gzip压缩的词典从一个文件? [英] Is there a way to store gzip's dictionary from a file?

查看:164
本文介绍了有没有一种方法来存储gzip压缩的词典从一个文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在做的COM pression为基础的文本分类的一些研究,我试图找出存储使用的EN codeR建(在训练文件)一本字典的方法对测试文件运行'静态'?这是在所有可能使用的是UNIX的GZIP工具?

I've been doing some research on compression-based text classification and I'm trying to figure out a way of storing a dictionary built by the encoder (on a training file) for use to run 'statically' on a test file? Is this at all possible using UNIX's gzip utility?

例如我一直在使用sport.txt和atheism.txt 2类的文件,所以我想运行在这两个文件的COM pression和存储他们使用的字典。接下来,我想借此测试文件(这是未标记的,可以是无神论或运动),并通过使用prebuilt字典对这个test.txt的我可以分析它如何COM presses的词典/模式下

For example I have been using 2 'class' files of sport.txt and atheism.txt, hence I want to run compression on both of these files and store their dictionaries used. Next I want to take a test file (which is unlabelled, could be either atheism or sport) and by using the prebuilt dictionaries on this test.txt I can analyse how well it compresses under that dictionary/model.

感谢

推荐答案

紧缩EN $​​ C $ CRS,如采用gzip和zlib,没有建一本字典。他们只需使用previous 32K字节作为潜在的匹配,以启动在当前位置的字节串的来源。最后的32K字节被称为字典,但这个名字也许是误导性的。

deflate encoders, as in gzip and zlib, do not "build" a dictionary. They simply use the previous 32K bytes as a source for potential matches to the string of bytes starting at the current position. The last 32K bytes is called the "dictionary", but the name is perhaps misleading.

您可以使用 zlib的尝试preSET字典。请参阅 deflateSetDictionary() inflateSetDictionary()功能。在这种情况下,zlib的玉米pression被灌注了一个字典的32K字节,有效地precede的第一个字节是玉米pressed作为用于匹配的来源,但辞典本身不是玉米pressed。在启动只能改善的第一个32K字节COM pression。在此之后,$ P $的pset词典是太靠后,以提供匹配

You can use zlib to experiment with preset dictionaries. See the deflateSetDictionary() and inflateSetDictionary() functions. In that case, zlib compression is primed with a "dictionary" of 32K bytes that effectively precede the first byte being compressed as a source for matches, but the dictionary itself is not compressed. The priming can only improve the compression of the first 32K bytes. After that, the preset dictionary is too far back to provide matches.

GZIP为preSET词典的支持。

gzip provides no support for preset dictionaries.

这篇关于有没有一种方法来存储gzip压缩的词典从一个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆