将CSV文件从任何类型转换为UTF-8 [英] Convert CSV file from any type to UTF-8

查看:774
本文介绍了将CSV文件从任何类型转换为UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我正在vb.net中创建一个简单的控制台应用程序,目的是将文件从任何类型转换为utf8,但我不知道该如何使用编码。我知道源文件是Unicode格式的,但是当我将其转换为新格式时,会出现垃圾。有什么建议么?我不确定我的代码是否正确

Hello I am creating a simple console application in vb.net in order to convert a file from any type to utf8 but i can't figure out how this thing works with the encoding. I know that the source file is in Unicode, but when i convert it to a new format i get junk. Any suggestions? i am not sure if my code is correct

这是我的代码。

Imports System.IO
Imports System.Text

Module Module1
    Sub Main()
        Console.Write("Please give the filepath (example:c:/tesfile.csv):")
        Dim filepath As String = Console.ReadLine()
        Dim sEncoding As String = DetermineFileType(filepath)
        Dim strContents As String
        Dim strEncodedContents As String
        Dim objReader As StreamReader
        Dim ErrInfo As String
        Dim bString As Byte()
        Try

            'Read the file
            objReader = New StreamReader(filepath)
            'Read untill the end
            strContents = objReader.ReadToEnd()
            'Close The file
            objReader.Close()
            'Write Contents on DOS
            Console.WriteLine(strContents)
            Console.WriteLine("")

            bString = EncodeString(strContents, "UTF-8")
            strEncodedContents = System.Text.Encoding.UTF8.GetString(bString)
            Dim objWriter As New System.IO.StreamWriter(filepath.Replace(".csv", "_encoded.csv"))
            objWriter.WriteLine(strEncodedContents)
            objWriter.Close()
            Console.WriteLine("Encoding Finished")

        Catch Ex As Exception
            ErrInfo = Ex.Message
            Console.WriteLine(ErrInfo)
        End Try        
        Console.ReadKey()
    End Sub

    Public Function DetermineFileType(ByVal aFileName As String) As String
        Dim sEncoding As String = String.Empty

        Dim oSR As New StreamReader(aFileName, True)
        oSR.ReadToEnd()
        ' Add this line to read the file.
        sEncoding = oSR.CurrentEncoding.EncodingName

        Return sEncoding
    End Function

    Function EncodeString(ByRef SourceData As String, ByRef CharSet As String) As Byte()
        'get a byte pointer To the source data
        Dim bSourceData As Byte() = System.Text.Encoding.Unicode.GetBytes(SourceData)

        'get destination encoding 
        Dim OutEncoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(CharSet)

        'Encode the data To destination code page/charset
        Return System.Text.Encoding.Convert(OutEncoding, System.Text.Encoding.UTF8, bSourceData)
    End Function
End Module


推荐答案

StreamReader具有 constructor 进行编码,如果您知道文件的编码,则应将其传递给StreamRead的构造函数er

StreamReader has a constructor that takes an Encoding if you know the encoding of the file you should pass that into the constructor of StreamReader

objReader = New StreamReader(filepath, Encoding.UTF32)



编辑



您在注释中说,该文件已从维基百科


较早的UCS-2(2字节通用字符集)是类似的
字符编码,在
Unicode标准的2.0版中于1996年7月被UTF-16取代。 2 只需使用代码点作为16位代码单元,它就能以
生成固定长度格式并针对
中所有代码点的96.9%(范围为0-0xFFFF)产生
,其结果与UTF-16完全相同,其中包括当时分配了
值的所有字符。

The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.2 It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time.

在这种情况下,您可以尝试在 System.Text.Encoding ,所以请尝试

In which case you can try to decode using UTF-16 which is called Unicode with in System.Text.Encoding so try

objReader = New StreamReader(filepath, Encoding.Unicode)

FYI
Unicode是一种标准,具有多种编码,包括

FYI Unicode is a standard which has a variety of encodings including


  • UTF-8

  • UTF-16(BigEndian)

  • UTF-16(LittleEndian)

  • UTF-32(BigEndian)
  • UTF-32(LittleEndian)

  • UTF-8
  • UTF-16 (BigEndian)
  • UTF-16 (LittleEndian)
  • UTF-32 (BigEndian)
  • UTF-32 (LittleEndian)

对于Microsoft调用UTF-16 Unicode有点误导但并非不准确,UTF-16是Unicode的一种可能的编码。

For Microsoft to call UTF-16 Unicode is a little misleading but not inaccurate, UTF-16 is one encoding possible for Unicode.

这篇关于将CSV文件从任何类型转换为UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆