将XML latin1转换为UTF-8以及其他方式 [英] Convert XML latin1 to UTF-8 and other way around

查看:110
本文介绍了将XML latin1转换为UTF-8以及其他方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将XML文件从Latin1转换为UTF-8,反之亦然. 我一直在做一些测试,但是我没有成功. 我正在使用

I am trying to convert an XML file from Latin1 to UTF-8 and the other way around. I have been doing some tests, but I fail to succeed this. I'm using

Get-Content C:\inputfile.xml | Set-Content -Encoding utf8 C:\outputfile.xml

但这不会转换任何内容.因此,我尝试在Get-Content中提供编码,但是在PowerShell中无法识别Latin1(或者这就是错误消息告诉我的内容). 最好的方法是什么?

But this is not converting anything. So I tried to give the encoding in the Get-Content, but Latin1 is not recognized in PowerShell (or that's what the error message is telling me). What's the best way to get this?

推荐答案

最快的方法(尤其是处理大型XML文件)是使用.NET

The fastest method, especially with large XML files, is to use .NET System.IO.File class.

[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')) | 
    Set-Content r:\2.txt -Encoding UTF8

  • 如果您的xml文件具有<?xml version="1.0" encoding="iso-8859-1" ?>,则也需要对其进行更改:

  • If your xml file has <?xml version="1.0" encoding="iso-8859-1" ?> it needs to be changed too:

    [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')).
        Replace('<?xml version="1.0" encoding="iso-8859-1"',
                '<?xml version="1.0" encoding="UTF-8"') | 
        Set-Content r:\2.txt -Encoding UTF8
    

  • 要编写Latin-1编码,请使用 WriteAllText 和明确提供的Latin- 1种编码:

  • To write Latin-1 encoding use WriteAllText with explicitly provided Latin-1 encoding:

    [IO.File]::WriteAllText(
        'r:\2.txt',
        [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::UTF8).
            Replace('<?xml version="1.0" encoding="UTF-8"',
                    '<?xml version="1.0" encoding="iso-8859-1"'),
        [Text.Encoding]::GetEncoding('iso-8859-1')
    )
    

  • 高效内存转换代码,可以处理任何大小的文件(1TB?没问题!):

  • Memory-efficient transcoding that can process files of any size (1TB? no problem!):

    function transcodeXML(
        [ValidateScript({Test-Path -Literal $_})]
        [string]$source,
        [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')]
        [string]$sourceEncoding,
    
        [ValidateScript({Test-Path -Literal $_ -IsValid})]
        [string]$target,
        [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')]
        [string]$targetEncoding
    ) {
        $reader = [IO.StreamReader]::new(
            $source,
            [Text.Encoding]::GetEncoding($sourceEncoding)
        )
        $writer = [IO.StreamWriter]::new(
            $target,
            $false, # don't append = overwrite
            [Text.Encoding]::GetEncoding($targetEncoding)
        )
        $buf = [char[]]::new(16MB)
    
        $nRead = $reader.Read($buf, 0, $buf.Length)
        $writer.Write(
            ([regex]"(<\?xml [^>]*?encoding="")(?i)$sourceEncoding(?="")").Replace(
                [string]::new($buf, 0, $nRead),
                '$1' + $targetEncoding,
                1 # speedup: one replacement only
            )
        )
        while (!$reader.EndOfStream) {
            $nRead = $reader.Read($buf, 0, $buf.Length)
            $writer.Write($buf, 0, $nRead)
        }
        $reader.Close()
        $writer.Close()
    }
    

    用法:

    transcodeXML 'r:\1.xml' iso-8859-1 'r:\2.xml' utf-8
    

  • 这篇关于将XML latin1转换为UTF-8以及其他方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆