编码API可以解码流/非连续字节吗? [英] Can the Encoding API decode a Stream/noncontinuous bytes?

查看:63
本文介绍了编码API可以解码流/非连续字节吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常,我们可以使用类似

Usually we can get a string from a byte[] using something like

var result = Encoding.UTF8.GetString(bytes);

但是,我遇到了这个问题:我的输入是 IEnumerable< byte []>.字节(实现可以是我选择的任何结构).不能保证字符在 byte [] 之内(例如,一个2字节的UTF8字符的第一个字节可以为bytes [1] [length-1],第二个字节可以为bytes)[2] [0]).

However, I am having this problem: my input is an IEnumerable<byte[]> bytes (implementation can be any structure of my choice). It is not guaranteed a character is within a byte[] (for example, a 2-byte UTF8 char can have its 1st byte in bytes[1][length - 1] and its 2nd byte in bytes[2][0]).

是否仍然可以在不将所有数组合并/复制在一起的情况下对它们进行解码? UTF8是主要重点,但最好是支持其他编码.如果没有其他解决方案,我认为可以实现自己的UTF8阅读.

Is there anyway to decode them without merging/copying all the array together? UTF8 is main focus but it is better if other Encoding can be supported. If there is no other solution, I think implementing my own UTF8 reading would be the way.

我计划使用 MemoryStream 来流式传输它们,但是编码只能在 byte [] 上的 Stream 上使用.如果合并在一起,则可能的结果数组可能会很大( List< byte []> 中的最大4GB).

I plan to stream them using a MemoryStream, however Encoding cannot work on Stream, just byte[]. If merged together, the potential result array may be very large (up to 4GB in List<byte[]> already).

我正在使用.NET Standard 2.0.我希望我可以使用2.1(因为它尚未发布),并使​​用 Span< byte []> ,对于我的情况来说是完美的!

I am using .NET Standard 2.0. I wish I could use 2.1 (as it is not released yet) and using Span<byte[]>, would be perfect for my case!

推荐答案

Encoding 类不能直接处理,但是 Encoding.GetDecoder() 可以(实际上,这是存在的全部原因). StreamReader 在内部使用 Decoder .

The Encoding class can't deal with that directly, but the Decoder returned from Encoding.GetDecoder() can (indeed, that's its entire reason for existing). StreamReader uses a Decoder internally.

虽然有点麻烦,但是它需要填充 char [] ,而不是返回 string ( Encoding.GetString() StreamReader 通常处理填充 char [] )的事务.

It's slightly fiddly to work with though, as it needs to populate a char[], rather than returning a string (Encoding.GetString() and StreamReader normally handle the business of populating the char[]).

使用 MemoryStream 的问题是,您将所有字节从一个数组复制到另一个数组,没有任何收益.如果所有缓冲区的长度都相同,则可以执行以下操作:

The problem with using a MemoryStream is that you're copying all of the bytes from one array to another, for no gain. If all of your buffers are the same length, you can do this:

var decoder = Encoding.UTF8.GetDecoder();
// +1 in case it includes a work-in-progress char from the previous buffer
char[] chars = decoder.GetMaxCharCount(bufferSize) + 1;
foreach (var byteSegment in bytes)
{
    int numChars = decoder.GetChars(byteSegment, 0, byteSegment.Length, chars, 0);
    Debug.WriteLine(new string(chars, 0, numChars));
}

如果缓冲区的长度不同:

If the buffers have different lengths:

var decoder = Encoding.UTF8.GetDecoder();
char[] chars = Array.Empty<char>();
foreach (var byteSegment in bytes)
{
    // +1 in case it includes a work-in-progress char from the previous buffer
    int charsMinSize = decoder.GetMaxCharCount(bufferSize) + 1;
    if (chars.Length < charsMinSize)
        chars = new char[charsMinSize];
    int numChars = decoder.GetChars(byteSegment, 0, byteSegment.Length, chars, 0);
    Debug.WriteLine(new string(chars, 0, numChars));
}

这篇关于编码API可以解码流/非连续字节吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆