Encoding.UTF8.GetString没有考虑到preamble / BOM [英] Encoding.UTF8.GetString doesn't take into account the Preamble/BOM

查看:257
本文介绍了Encoding.UTF8.GetString没有考虑到preamble / BOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在.NET中,我试图用 Encoding.UTF8.GetString 的方法,这需要一个字节数组,并将其转换为字符串

看起来这种方法忽略了 BOM(字节顺序标记),这可能是一个合法的二进制文件的一部分再presentation一个UTF8字符串,并将其作为一个字符。

我知道我可以使用的TextReader 根据需要消化的BOM,但我认为GetString方法应该是某种宏,使我们的$ C $ Ç短。

我缺少的东西?这是像这样故意?

下面是一个再现code:

 静态无效的主要(字串[] args)
{
    字符串S1 =ABC;
    byte []的abcWithBom;
    使用(VAR毫秒=新的MemoryStream())
    使用(VAR SW =新的StreamWriter(MS,新UTF8Encoding(真)))
    {
        sw.Write(S1);
        sw.Flush();
        abcWithBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithBom)); // EF,BB,BF,61,62,63
    }

    byte []的abcWithoutBom;
    使用(VAR毫秒=新的MemoryStream())
    使用(VAR SW =新的StreamWriter(MS,新UTF8Encoding(假)))
    {
        sw.Write(S1);
        sw.Flush();
        abcWithoutBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithoutBom)); // 61,62,63
    }

    VAR restore1 = Encoding.UTF8.GetString(abcWithoutBom);
    Console.WriteLine(restore1.Length); // 3
    Console.WriteLine(restore1); // ABC

    VAR restore2 = Encoding.UTF8.GetString(abcWithBom);
    Console.WriteLine(restore2.Length); // 4(!)
    Console.WriteLine(restore2); //?ABC
}

私人静态字符串FormatArray(byte []的bytes1)
{
    返回的string.join(,,从步骤b中bytes1选择b.ToString(×));
}
 

解决方案
  

看起来这种方法忽略了BOM(字节顺序标记),这可能是一个UTF8字符串的合法二进制重新presentation的一部分,并把它作为一个字符。

它看起来并不像它忽略它 - 它忠实地将其转换为BOM字符。那它是什么,毕竟。

如果你想的的code忽略BOM在其转换任何字符串,这是给你做......或使用的StreamReader

请注意,如果你的或者的使用 Encoding.GetBytes 然后按 Encoding.GetString 的使用的StreamWriter 然后按的StreamReader ,这两种形式要么产生再吞或不产生BOM表。当你混合使用只是一个的StreamWriter (使用 Encoding.Get preamble )有直接 Encoding.GetString 你最终的额外字符的呼叫。

In .NET, I'm trying to use Encoding.UTF8.GetString method, which takes a byte array and converts it to a string.

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.

I know I can use a TextReader to digest the BOM as needed, but I thought that the GetString method should be some kind of a macro that makes our code shorter.

Am I missing something? Is this like so intentionally?

Here's a reproduction code:

static void Main(string[] args)
{
    string s1 = "abc";
    byte[] abcWithBom;
    using (var ms = new MemoryStream())
    using (var sw = new StreamWriter(ms, new UTF8Encoding(true)))
    {
        sw.Write(s1);
        sw.Flush();
        abcWithBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithBom)); // ef, bb, bf, 61, 62, 63
    }

    byte[] abcWithoutBom;
    using (var ms = new MemoryStream())
    using (var sw = new StreamWriter(ms, new UTF8Encoding(false)))
    {
        sw.Write(s1);
        sw.Flush();
        abcWithoutBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithoutBom)); // 61, 62, 63
    }

    var restore1 = Encoding.UTF8.GetString(abcWithoutBom);
    Console.WriteLine(restore1.Length); // 3
    Console.WriteLine(restore1); // abc

    var restore2 = Encoding.UTF8.GetString(abcWithBom);
    Console.WriteLine(restore2.Length); // 4 (!)
    Console.WriteLine(restore2); // ?abc
}

private static string FormatArray(byte[] bytes1)
{
    return string.Join(", ", from b in bytes1 select b.ToString("x"));
}

解决方案

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.

It doesn't look like it "ignores" it at all - it faithfully converts it to the BOM character. That's what it is, after all.

If you want to make your code ignore the BOM in any string it converts, that's up to you to do... or use StreamReader.

Note that if you either use Encoding.GetBytes followed by Encoding.GetString or use StreamWriter followed by StreamReader, both forms will either produce then swallow or not produce the BOM. It's only when you mix using a StreamWriter (which uses Encoding.GetPreamble) with a direct Encoding.GetString call that you end up with the "extra" character.

这篇关于Encoding.UTF8.GetString没有考虑到preamble / BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆