将XML latin1转换为UTF-8以及其他方式 [英] Convert XML latin1 to UTF-8 and other way around
问题描述
我正在尝试将XML文件从Latin1转换为UTF-8,反之亦然. 我一直在做一些测试,但是我没有成功. 我正在使用
I am trying to convert an XML file from Latin1 to UTF-8 and the other way around. I have been doing some tests, but I fail to succeed this. I'm using
Get-Content C:\inputfile.xml | Set-Content -Encoding utf8 C:\outputfile.xml
但这不会转换任何内容.因此,我尝试在Get-Content
中提供编码,但是在PowerShell中无法识别Latin1(或者这就是错误消息告诉我的内容).
最好的方法是什么?
But this is not converting anything. So I tried to give the encoding in the Get-Content
, but Latin1 is not recognized in PowerShell (or that's what the error message is telling me).
What's the best way to get this?
推荐答案
The fastest method, especially with large XML files, is to use .NET System.IO.File class.
-
使用带有明确提供的Latin-1编码的 ReadAllText :
[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')) |
Set-Content r:\2.txt -Encoding UTF8
如果您的xml文件具有<?xml version="1.0" encoding="iso-8859-1" ?>
,则也需要对其进行更改:
If your xml file has <?xml version="1.0" encoding="iso-8859-1" ?>
it needs to be changed too:
[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')).
Replace('<?xml version="1.0" encoding="iso-8859-1"',
'<?xml version="1.0" encoding="UTF-8"') |
Set-Content r:\2.txt -Encoding UTF8
要编写Latin-1编码,请使用 WriteAllText 和明确提供的Latin- 1种编码:
To write Latin-1 encoding use WriteAllText with explicitly provided Latin-1 encoding:
[IO.File]::WriteAllText(
'r:\2.txt',
[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::UTF8).
Replace('<?xml version="1.0" encoding="UTF-8"',
'<?xml version="1.0" encoding="iso-8859-1"'),
[Text.Encoding]::GetEncoding('iso-8859-1')
)
高效内存转换代码,可以处理任何大小的文件(1TB?没问题!):
Memory-efficient transcoding that can process files of any size (1TB? no problem!):
function transcodeXML(
[ValidateScript({Test-Path -Literal $_})]
[string]$source,
[ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')]
[string]$sourceEncoding,
[ValidateScript({Test-Path -Literal $_ -IsValid})]
[string]$target,
[ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')]
[string]$targetEncoding
) {
$reader = [IO.StreamReader]::new(
$source,
[Text.Encoding]::GetEncoding($sourceEncoding)
)
$writer = [IO.StreamWriter]::new(
$target,
$false, # don't append = overwrite
[Text.Encoding]::GetEncoding($targetEncoding)
)
$buf = [char[]]::new(16MB)
$nRead = $reader.Read($buf, 0, $buf.Length)
$writer.Write(
([regex]"(<\?xml [^>]*?encoding="")(?i)$sourceEncoding(?="")").Replace(
[string]::new($buf, 0, $nRead),
'$1' + $targetEncoding,
1 # speedup: one replacement only
)
)
while (!$reader.EndOfStream) {
$nRead = $reader.Read($buf, 0, $buf.Length)
$writer.Write($buf, 0, $nRead)
}
$reader.Close()
$writer.Close()
}
用法:
transcodeXML 'r:\1.xml' iso-8859-1 'r:\2.xml' utf-8
这篇关于将XML latin1转换为UTF-8以及其他方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!