无法在Windows中更改文本文件的编码 [英] Unable to change encoding of text files in Windows
问题描述
我有一些具有不同编码的文本文件.其中一些是 UTF-8
,而另一些则是 windows-1251
编码.我试图执行以下递归脚本,将其全部编码为 UTF-8
.
I have some text files with different encodings. Some of them are UTF-8
and some others are windows-1251
encoded. I tried to execute following recursive script to encode it all to UTF-8
.
Get-ChildItem *.nfo -Recurse | ForEach-Object {
$content = $_ | Get-Content
Set-Content -PassThru $_.Fullname $content -Encoding UTF8 -Force}
此后,由于UTF-8编码也有错误的编码,因此我无法在Java程序中使用文件,因此无法获取原始文本.对于Windows-1251编码的文件,与原始文件一样,我得到的输出为空.因此,它会破坏已被UTF-8编码的文件.
After that I am unable to use files in my Java program, because UTF-8 encoded has also wrong encoding, I couldn't get back original text. In case of windows-1251 encoded files I get empty output as in case of original files. So it makes corrupt already UTF-8 encoded files.
我找到了另一个解决方案 iconv
,但是据我所知,它需要当前编码作为参数.
I found another solution, iconv
, but as I see it needs current encoding as parameter.
$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile
不同编码的文件混合在文件夹结构中,因此文件应保留在同一路径上.
Differently encoded files are mixed in a folder structure, so files should stay on same path.
系统使用代码页852.现有的UTF-8文件没有BOM.
System uses Code page 852. Existing UTF-8 files are without BOM.
推荐答案
在Windows PowerShell中,由于以下两个原因,您将无法使用内置cmdlet:
In Windows PowerShell you won't be able to use the built-in cmdlets for two reasons:
-
从您的OEM代码页为
852
,我推断您的"ANSI"代码页为Windows-1250
(均由旧版系统区域设置定义),与您的Windows-1251
编码的输入文件不匹配.
From your OEM code page being
852
I infer that your "ANSI" code page isWindows-1250
(both defined by the legacy system locale), which doesn't match yourWindows-1251
-encoded input files.
将 Set-Content
(及类似内容)与一起使用-编码UTF8
不变会创建具有
Using Set-Content
(and similar) with -Encoding UTF8
invariably creates files with a BOM (byte-order mark), which Java and, more generally, Unix-heritage utilities don't understand.
注意:PowerShell Core 实际上默认为 BOM-less UTF8,并且还允许您传递任何可用的 [System.Text.Encoding]
实例添加到 -Encoding
参数,因此您可以使用那里的内置cmdlet解决问题,而只需要直接使用.NET框架来构造编码实例.
Note: PowerShell Core actually defaults to BOM-less UTF8 and also allows you to pass any available [System.Text.Encoding]
instance to the -Encoding
parameter, so you could solve your problem with the built-in cmdlets there, while needing direct use of the .NET framework only to construct an encoding instance.
因此,您必须直接使用.NET框架:
You must therefore use the .NET framework directly:
Get-ChildItem *.nfo -Recurse | ForEach-Object {
$file = $_.FullName
$mustReWrite = $false
# Try to read as UTF-8 first and throw an exception if
# invalid-as-UTF-8 bytes are encountered.
try {
[IO.File]::ReadAllText($file, [Text.Utf8Encoding]::new($false, $true))
} catch [System.Text.DecoderFallbackException] {
# Fall back to Windows-1251
$content = [IO.File]::ReadAllText($file, [Text.Encoding]::GetEncoding(1251))
$mustReWrite = $true
}
# Rewrite as UTF-8 without BOM (the .NET frameworks' default)
if ($mustReWrite) {
Write-Verbose "Converting from 1251 to UTF-8: $file"
[IO.File]::WriteAllText($file, $content)
} else {
Write-Verbose "Already UTF-8-encoded: $file"
}
}
注意:与您自己的尝试一样,上述解决方案将每个文件作为一个整体读取到内存中,但是可以更改.
注意:
-
如果输入文件仅包含具有ASCII范围字符(7位)的字节,则根据定义它也是UTF-8编码的,因为UTF-8是ASCII编码的超集.
If an input file comprises only bytes with ASCII-range characters (7-bit), it is by definition also UTF-8-encoded, because UTF-8 is a superset of ASCII encoding.
在现实世界中输入的可能性很小,但是纯粹地技术上 Windows-1251编码的文件也可以也是有效的UTF-8文件, if 位模式和字节序列恰好是有效的UTF-8(关于在何处允许使用哪种位模式有严格的规定).
但是,此类文件不会包含有意义的Windows-1251内容.
It is highly unlikely with real-world input, but purely technically a Windows-1251-encoded file could be a valid UTF-8 file as well, if the bit patterns and byte sequences happen to be valid UTF-8 (which has strict rules around what bit patterns are allowed where).
Such a file would not contain meaningful Windows-1251 content, however.
没有理由在Windows-1251上实施回退策略以进行解码,因为对于在何处可能出现的位模式没有技术限制.
通常,在没有外部信息(或BOM)的情况下,没有简单且健壮的方法仅从文件的内容推断出文件的编码(尽管可以使用启发式方法).
There is no reason to implement a fallback strategy for decoding with Windows-1251, because there is no technical restrictions on what bit patterns can occur where.
Generally, in the absence of external information (or a BOM), there's no simple and no robust way to infer a file's encoding just from its content (though heuristics can be employed).
这篇关于无法在Windows中更改文本文件的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!