使用.NET如何ISO 8859-1转换连接包含Latin-1的重音字符为UTF-8 codeD文本文件 [英] Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8

查看:133
本文介绍了使用.NET如何ISO 8859-1转换连接包含Latin-1的重音字符为UTF-8 codeD文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在发送的文本文件包含来自重音字符Latin-1的范围内(以及普通的ASCII AZ等)。我如何把这些文件用C# UTF-8 使单字节的ISO重音字符8859- 1成为有效的UTF-8字符?

I am being sent text files saved in ISO 88591-1 format that contain accented characters from the Latin-1 range (as well as normal ASCII a-z, etc.). How do I convert these files to UTF-8 using C# so that the single-byte accented characters in ISO 8859-1 become valid UTF-8 characters?

我曾尝试使用一个StreamReader与ASCIIEncoding,然后通过实例化编码的ASCII字符串转换为UTF-8 ASCII 和编码 UTF8 ,然后用 Encoding.Convert(ASCII,UTF8,ascii.GetBytes(asciiString))—但重音符号被渲染为问号。

I have tried to use a StreamReader with ASCIIEncoding, and then converting the ASCII string to UTF-8 by instantiating encoding ascii and encoding utf8 and then using Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) — but the accented characters are being rendered as question marks.

哪一步我缺少什么?

推荐答案

您需要得到适当的编码对象。 ASCII是一样它的命名:ASCII,这意味着它仅支持7位ASCII字符。如果你想要做的是转换的文件,那么这可能比直接处理的字节数组更加容易。

You need to get the proper Encoding object. ASCII is just as it's named: ASCII, meaning that it only supports 7-bit ASCII characters. If what you want to do is convert files, then this is likely easier than dealing with the byte arrays directly.

using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName,
                                       Encoding.GetEncoding("iso-8859-1")))
{
    using (System.IO.StreamWriter writer = new System.IO.StreamWriter(
                                           outFileName, Encoding.UTF8))
    {
        writer.Write(reader.ReadToEnd());
    }
}

不过,如果你想自己有字节数组,它很容易与 Encoding.Convert 做的。

byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"), 
    Encoding.UTF8, data);

不过这里要注意,这一点很重要,如果你希望走这条路,那么你应该的的使用基于编码字符串读者如的StreamReader 为您的文件IO。 的FileStream 会更适合,因为它会读取这些文件的实际字节。

It's important to note here, however, that if you want to go down this road then you should not use an encoding-based string reader like StreamReader for your file IO. FileStream would be better suited, as it will read the actual bytes of the files.

在充分探讨这个问题兴趣,像这样的工作:

In the interest of fully exploring the issue, something like this would work:

using (System.IO.FileStream input = new System.IO.FileStream(fileName,
                                    System.IO.FileMode.Open, 
                                    System.IO.FileAccess.Read))
{
    byte[] buffer = new byte[input.Length];

    int readLength = 0;

    while (readLength < buffer.Length) 
        readLength += input.Read(buffer, readLength, buffer.Length - readLength);

    byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"), 
                       Encoding.UTF8, buffer);

    using (System.IO.FileStream output = new System.IO.FileStream(outFileName,
                                         System.IO.FileMode.Create, 
                                         System.IO.FileAccess.Write))
    {
        output.Write(converted, 0, converted.Length);
    }
}

在这个例子中,缓存变量被充满了实际数据文件在字节[] ,所以没有转换完成。 Encoding.Convert 指定源和目标编码,然后存储在名为变量转换的字节... 转换。这随后将被写入直接输出文件

In this example, the buffer variable gets filled with the actual data in the file as a byte[], so no conversion is done. Encoding.Convert specifies a source and destination encoding, then stores the converted bytes in the variable named...converted. This is then written to the output file directly.

就像我说的,使用的第一个选项的StreamReader 的StreamWriter 将会简单得多,如果这是你重新做,但后者的例子应该给你更多的是暗示,以什么实际发生的。

Like I said, the first option using StreamReader and StreamWriter will be much simpler if this is all you're doing, but the latter example should give you more of a hint as to what's actually going on.

这篇关于使用.NET如何ISO 8859-1转换连接包含Latin-1的重音字符为UTF-8 codeD文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆