Encoding.UTF8.GetString 不考虑 Preamble/BOM [英] Encoding.UTF8.GetString doesn't take into account the Preamble/BOM
问题描述
在 .NET 中,我尝试使用 Encoding.UTF8.GetString
方法,该方法接受一个字节数组并将其转换为 string
.
In .NET, I'm trying to use Encoding.UTF8.GetString
method, which takes a byte array and converts it to a string
.
看起来这个方法忽略了BOM(字节顺序标记),这可能是一个UTF8 字符串合法二进制表示的一部分,并将其作为字符.
It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.
我知道我可以根据需要使用 TextReader
来消化 BOM,但我认为 GetString 方法应该是某种可以缩短代码的宏.
I know I can use a TextReader
to digest the BOM as needed, but I thought that the GetString method should be some kind of a macro that makes our code shorter.
我错过了什么吗?这是故意的吗?
Am I missing something? Is this like so intentionally?
这是一个复制代码:
static void Main(string[] args)
{
string s1 = "abc";
byte[] abcWithBom;
using (var ms = new MemoryStream())
using (var sw = new StreamWriter(ms, new UTF8Encoding(true)))
{
sw.Write(s1);
sw.Flush();
abcWithBom = ms.ToArray();
Console.WriteLine(FormatArray(abcWithBom)); // ef, bb, bf, 61, 62, 63
}
byte[] abcWithoutBom;
using (var ms = new MemoryStream())
using (var sw = new StreamWriter(ms, new UTF8Encoding(false)))
{
sw.Write(s1);
sw.Flush();
abcWithoutBom = ms.ToArray();
Console.WriteLine(FormatArray(abcWithoutBom)); // 61, 62, 63
}
var restore1 = Encoding.UTF8.GetString(abcWithoutBom);
Console.WriteLine(restore1.Length); // 3
Console.WriteLine(restore1); // abc
var restore2 = Encoding.UTF8.GetString(abcWithBom);
Console.WriteLine(restore2.Length); // 4 (!)
Console.WriteLine(restore2); // ?abc
}
private static string FormatArray(byte[] bytes1)
{
return string.Join(", ", from b in bytes1 select b.ToString("x"));
}
推荐答案
看起来这个方法忽略了 BOM(字节顺序标记),它可能是 UTF8 字符串的合法二进制表示的一部分,并将其作为字符.
It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.
它看起来根本没有忽略"它 - 它忠实地将其转换为 BOM 字符.毕竟就是这样.
It doesn't look like it "ignores" it at all - it faithfully converts it to the BOM character. That's what it is, after all.
如果您想让您的代码忽略它转换的任何字符串中的 BOM,这取决于您...或使用 StreamReader
.
If you want to make your code ignore the BOM in any string it converts, that's up to you to do... or use StreamReader
.
请注意,如果您要么使用 Encoding.GetBytes
后跟 Encoding.GetString
或 使用 StreamWriter
后跟 StreamReader
,两种形式要么产生然后吞下,要么不产生 BOM.只有当您将 StreamWriter
(使用 Encoding.GetPreamble
)与直接的 Encoding.GetString
调用混合使用时,您最终会得到额外的"字符.
Note that if you either use Encoding.GetBytes
followed by Encoding.GetString
or use StreamWriter
followed by StreamReader
, both forms will either produce then swallow or not produce the BOM. It's only when you mix using a StreamWriter
(which uses Encoding.GetPreamble
) with a direct Encoding.GetString
call that you end up with the "extra" character.
这篇关于Encoding.UTF8.GetString 不考虑 Preamble/BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!