来自MemoryStream的GZipStream仅返回几百个字节 [英] GZipStream from MemoryStream only returns a few hundred bytes

查看:92
本文介绍了来自MemoryStream的GZipStream仅返回几百个字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试下载一个数百MB的.gz文件,然后在C#中将其转换成一个很长的字符串.

I am trying to download a .gz file of a few hundred MBs, and turn it into a very long string in C#.

using (var memstream = new MemoryStream(new WebClient().DownloadData(url)))
using (GZipStream gs = new GZipStream(memstream, CompressionMode.Decompress))
using (var outmemstream = new MemoryStream())
{
    gs.CopyTo(outmemstream);
    string t = Encoding.UTF8.GetString(outmemstream.ToArray());
    Console.WriteLine(t);
}

我的测试网址: memstream的长度为283063949.该程序在初始化的那行停留了大约15秒钟,并且我的网络在此期间停滞了,这很有意义.

memstream has a length of 283063949. The program lingers for about 15 seconds on the line where it is initialized, and my network is floored during it, which makes sense.

outmemstream的长度只有548.

outmemstream has a length of only 548.

压缩后的文档的第一行是写在命令行上的.他们没有乱码.我不确定如何得到其余的东西.

Written to the command line is the first lines of the zipped document. They are not garbled. I'm not sure how to get the rest.

推荐答案

.NET GZipStream将纯文本的前548个字节解包,这是文件中的第一个记录. 7Zip将整个文件提取为1.2GB的输出文件,但它是纯文本(价值130万行),没有记录分隔符,当我在7Zip中测试该文件时,它将报告1,441字节.

The .NET GZipStream unpacks the first 548 bytes of the plain text, which is all of the first record in the file. 7Zip extracts the whole file to a 1.2GB output file, but it is plain text (about 1.3 million lines worth) with no record separators, and when I test the file in 7Zip it reports 1,441 bytes.

我检查了几件事,找不到单个压缩库直接将其解压缩.

I checked a few things and couldn't find a single compression library that would unpack this thing directly.

在文件中进行了一些转换后,我发现ISIZE的值是1,441字节,通常是gzip文件的后4个字节,这是8字节页脚记录的一部分,附加到压缩后的数据块.

After a bit of casting about in the file I found that 1,441 bytes is the value of ISIZE which is normally the last 4 bytes of the gzip file, part of an 8-byte footer record that is appended to the compressed data chunks.

事实证明,您所拥有的是一大堆串联在一起的.gz文件.尽管这完全是一种痛苦,但是您可以通过几种方法来解决这个问题.

It turns out that what you have is a big set of .gz files concatenated together. And while that's a complete pain in the butt, there are a few ways you can approach this.

首先是扫描压缩文件中的gzip标头签名字节:0x1F0x8B.找到这些文件后,您(通常)将拥有流中每个.gz文件的开头.您可以在文件中构建偏移列表,然后提取文件的每个块并将其解压缩.

The first is to scan the compressed file for the gzip header signature bytes: 0x1F and 0x8B. When you locate these you will (usually) have the start of each .gz file in the stream. You can build a list of offsets in the file and then extract each chunk of the file and decompress it.

另一个选择是使用一个库,该库将报告从输入流消耗的字节数.由于几乎所有的解压缩器都使用某种类型的缓冲,因此您会发现输入流的移动量将远大于消耗的字节数,因此很难直接猜测.但是,DotNetZip流将为您提供实际消耗的输入字节,您可以使用它们来确定下一个起始位置.这样,您就可以将文件作为流进行处理,并分别提取每个文件.

Another option is to use a library that will report the number of bytes consumed from the input stream. Since almost all decompressors use buffering of some sort you will find that the input stream will move much further than the number of bytes consumed, so this is difficult to guess at directly. The DotNetZip streams however will give you the actual consumed input bytes, which you can use to figure out the next starting position. This will allow you to process the file as a stream and extract each file individually.

无论哪种方式,都不会很快.

Either way, not fast.

以下是使用DotNetZip库的第二种方法:

Here's a method for the second option, using the DotNetZip library:

public static IEnumerable<byte[]> UnpackCompositeFile(string filename)
{
    using (var fstream = File.OpenRead(filename))
    {
        long offset = 0;
        while (offset < fstream.Length)
        {
            fstream.Position = p;
            byte[] bytes = null;
            using (var ms = new MemoryStream())
            using (var unpack = new Ionic.Zlib.GZipStream(fstream, Ionic.Zlib.CompressionMode.Decompress, true))
            {
                unpack.CopyTo(ms);
                bytes = ms.ToArray();
                // Total compressed bytes read, plus 10 for GZip header, plus 8 for GZip footer
                offset += unpack.TotalIn + 18;
            }
            yield return bytes;
        }
    }
}

这很丑陋,而且速度不快(花了大约48秒的时间来解压缩整个文件),但它似乎可以正常工作.每个byte[]输出代表流中的单个压缩文件.可以使用System.Text.Encoding.UTF8.GetString(...)将它们转换为字符串,然后进行解析以提取含义.

It's ugly and not fast (took me about 48 seconds to decompress the whole file) but it appears to work. Each byte[] output represents a single compressed file in the stream. These can be turned into strings with System.Text.Encoding.UTF8.GetString(...) and then parsed to extract the meaning.

文件中的最后一项如下:

The last item in the file looks like this:

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: https://zverek-shop.ru/dljasobak/ruletka_sobaki/ruletka-tros_standard_5_m_dlya_sobak_do_20_kg
WARC-Date: 2017-11-25T14:16:01Z
WARC-Record-ID: <urn:uuid:e19ef645-b057-4305-819f-7be2687c3f19>
WARC-Refers-To: <urn:uuid:df5de410-d4af-45ce-b545-c699e535765f>
Content-Type: application/json
Content-Length: 1075

{"Container":{"Filename":"CC-MAIN-20171117170336-20171117190336-00002.warc.gz","Compressed":true,"Offset":"904209205","Gzip-Metadata":{"Inflated-Length":"463","Footer-Length":"8","Inflated-CRC":"1610542914","Deflate-Length":"335","Header-Length":"10"}},"Envelope":{"Format":"WARC","WARC-Header-Length":"438","Actual-Content-Length":"21","WARC-Header-Metadata":{"WARC-Target-URI":"https://zverek-shop.ru/dljasobak/ruletka_sobaki/ruletka-tros_standard_5_m_dlya_sobak_do_20_kg","WARC-Warcinfo-ID":"<urn:uuid:283e4862-166e-424c-b8fd-023bfb4f18f2>","WARC-Concurrent-To":"<urn:uuid:ca594c00-269b-4690-b514-f2bfc39c2d69>","WARC-Date":"2017-11-17T17:43:04Z","Content-Length":"21","WARC-Record-ID":"<urn:uuid:df5de410-d4af-45ce-b545-c699e535765f>","WARC-Type":"metadata","Content-Type":"application/warc-fields"},"Block-Digest":"sha1:4SKCIFKJX5QWLVICLR5Y2BYE6IBVMO3Z","Payload-Metadata":{"Actual-Content-Type":"application/metadata-fields","WARC-Metadata-Metadata":{"Metadata-Records":[{"Value":"1140","Name":"fetchTimeMs"}]},"Actual-Content-Length":"21","Trailing-Slop-Length":"0"}}}

这条记录占用1,441字节,包括其后的两个空白行.

This is the record that occupies 1,441 bytes, including the two blank lines after it.

只是为了完整性...

Just for the sake of completeness...

TotalIn属性返回读取的压缩字节数,不包括GZip标头和页脚.在上面的代码中,我使用恒定的18个字节作为页眉和页脚大小,这是GZip的最小大小.尽管此文件适用于该文件,但是其他处理串联的GZip文件的人可能会发现标头中还有其他数据使该文件变大,这将阻止上面的文件工作.

The TotalIn property returns the number of compressed bytes read, not including the GZip header and footer. In the code above I use a constant 18 bytes for the header and footer size, which is the minimum size of these for GZip. While that works for this file, anyone else dealing with concatenated GZip files may find that there is additional data in the header that makes it larger, which will stop the above from working.

在这种情况下,您有两个选择:

In this case you have two options:

  • 直接解析GZip标头,然后使用DeflateStream解压缩.
  • 扫描从TotalIn + 18个字节开始的GZip签名字节.
  • Parse the GZip header directly and use DeflateStream to decompress.
  • Scan for the GZip signature bytes starting at TotalIn + 18 bytes.

任何一种方法都应该使您的速度不会太慢.由于解压缩代码中发生缓冲,因此您将不得不在每个段之后向后搜索流,因此读取一些额外的字节不会使您的速度减慢.

Either should work without slowing you down too much. Since buffering is happening in the decompression code you're going to have to seek the stream backwards after each segment, so reading some additional bytes doesn't slow you down too much.

这篇关于来自MemoryStream的GZipStream仅返回几百个字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆