Encoding.UTF8.GetString 不考虑 Preamble/BOM [英] Encoding.UTF8.GetString doesn't take into account the Preamble/BOM

查看:22
本文介绍了Encoding.UTF8.GetString 不考虑 Preamble/BOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 .NET 中,我尝试使用 Encoding.UTF8.GetString 方法,该方法接受一个字节数组并将其转换为 string.

In .NET, I'm trying to use Encoding.UTF8.GetString method, which takes a byte array and converts it to a string.

看起来这个方法忽略了BOM(字节顺序标记),这可能是一个UTF8 字符串合法二进制表示的一部分,并将其作为字符.

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.

我知道我可以根据需要使用 TextReader 来消化 BOM,但我认为 GetString 方法应该是某种可以缩短代码的宏.

I know I can use a TextReader to digest the BOM as needed, but I thought that the GetString method should be some kind of a macro that makes our code shorter.

我错过了什么吗?这是故意的吗?

Am I missing something? Is this like so intentionally?

这是一个复制代码:

static void Main(string[] args)
{
    string s1 = "abc";
    byte[] abcWithBom;
    using (var ms = new MemoryStream())
    using (var sw = new StreamWriter(ms, new UTF8Encoding(true)))
    {
        sw.Write(s1);
        sw.Flush();
        abcWithBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithBom)); // ef, bb, bf, 61, 62, 63
    }

    byte[] abcWithoutBom;
    using (var ms = new MemoryStream())
    using (var sw = new StreamWriter(ms, new UTF8Encoding(false)))
    {
        sw.Write(s1);
        sw.Flush();
        abcWithoutBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithoutBom)); // 61, 62, 63
    }

    var restore1 = Encoding.UTF8.GetString(abcWithoutBom);
    Console.WriteLine(restore1.Length); // 3
    Console.WriteLine(restore1); // abc

    var restore2 = Encoding.UTF8.GetString(abcWithBom);
    Console.WriteLine(restore2.Length); // 4 (!)
    Console.WriteLine(restore2); // ?abc
}

private static string FormatArray(byte[] bytes1)
{
    return string.Join(", ", from b in bytes1 select b.ToString("x"));
}

推荐答案

看起来这个方法忽略了 BOM(字节顺序标记),它可能是 UTF8 字符串的合法二进制表示的一部分,并将其作为字符.

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.

它看起来根本没有忽略"它 - 它忠实地将其转换为 BOM 字符.毕竟就是这样.

It doesn't look like it "ignores" it at all - it faithfully converts it to the BOM character. That's what it is, after all.

如果您想让您的代码忽略它转换的任何字符串中的 BOM,这取决于您...或使用 StreamReader.

If you want to make your code ignore the BOM in any string it converts, that's up to you to do... or use StreamReader.

请注意,如果您要么使用 Encoding.GetBytes 后跟 Encoding.GetString 使用 StreamWriter 后跟 StreamReader,两种形式要么产生然后吞下,要么不产生 BOM.只有当您将 StreamWriter(使用 Encoding.GetPreamble)与直接的 Encoding.GetString 调用混合使用时,您最终会得到额外的"字符.

Note that if you either use Encoding.GetBytes followed by Encoding.GetString or use StreamWriter followed by StreamReader, both forms will either produce then swallow or not produce the BOM. It's only when you mix using a StreamWriter (which uses Encoding.GetPreamble) with a direct Encoding.GetString call that you end up with the "extra" character.

这篇关于Encoding.UTF8.GetString 不考虑 Preamble/BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆