无法在Windows中更改文本文件的编码 [英] Unable to change encoding of text files in Windows

查看：76 发布时间：2021/4/21 20:24:40 windows powershell character-encoding command-prompt iconv

本文介绍了无法在Windows中更改文本文件的编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些具有不同编码的文本文件.其中一些是 UTF-8 ，而另一些则是 windows-1251 编码.我试图执行以下递归脚本，将其全部编码为 UTF-8 .

I have some text files with different encodings. Some of them are UTF-8 and some others are windows-1251 encoded. I tried to execute following recursive script to encode it all to UTF-8.

Get-ChildItem *.nfo -Recurse | ForEach-Object {
$content = $_ | Get-Content

Set-Content -PassThru $_.Fullname $content -Encoding UTF8 -Force}

此后，由于UTF-8编码也有错误的编码，因此我无法在Java程序中使用文件，因此无法获取原始文本.对于Windows-1251编码的文件，与原始文件一样，我得到的输出为空.因此，它会破坏已被UTF-8编码的文件.

After that I am unable to use files in my Java program, because UTF-8 encoded has also wrong encoding, I couldn't get back original text. In case of windows-1251 encoded files I get empty output as in case of original files. So it makes corrupt already UTF-8 encoded files.

我找到了另一个解决方案 iconv ，但是据我所知，它需要当前编码作为参数.

I found another solution, iconv, but as I see it needs current encoding as parameter.

$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile

不同编码的文件混合在文件夹结构中，因此文件应保留在同一路径上.

Differently encoded files are mixed in a folder structure, so files should stay on same path.

系统使用代码页852.现有的UTF-8文件没有BOM.

System uses Code page 852. Existing UTF-8 files are without BOM.

推荐答案

在Windows PowerShell中，由于以下两个原因，您将无法使用内置cmdlet:

In Windows PowerShell you won't be able to use the built-in cmdlets for two reasons:

从您的OEM代码页为 852 ，我推断您的"ANSI"代码页为 Windows-1250 (均由旧版系统区域设置定义)，与您的 Windows-1251 编码的输入文件不匹配.

From your OEM code page being 852 I infer that your "ANSI" code page is Windows-1250 (both defined by the legacy system locale), which doesn't match your Windows-1251-encoded input files.

将 Set-Content (及类似内容)与一起使用-编码UTF8 不变会创建具有="https://en.wikipedia.org/wiki/Byte_order_mark" rel ="nofollow noreferrer"> BOM(字节顺序标记) ，它是Java以及更广泛的Unix继承实用程序不明白.

Using Set-Content (and similar) with -Encoding UTF8 invariably creates files with a BOM (byte-order mark), which Java and, more generally, Unix-heritage utilities don't understand.

^{注意:PowerShell Core 实际上默认为 BOM-less UTF8，并且还允许您传递任何可用的 [System.Text.Encoding] 实例添加到 -Encoding 参数，因此您可以使用那里的内置cmdlet解决问题，而只需要直接使用.NET框架来构造编码实例.}

^{Note: PowerShell Core actually defaults to BOM-less UTF8 and also allows you to pass any available [System.Text.Encoding] instance to the -Encoding parameter, so you could solve your problem with the built-in cmdlets there, while needing direct use of the .NET framework only to construct an encoding instance.}

因此，您必须直接使用.NET框架:

You must therefore use the .NET framework directly:

Get-ChildItem *.nfo -Recurse | ForEach-Object {

  $file = $_.FullName

  $mustReWrite = $false
  # Try to read as UTF-8 first and throw an exception if 
  # invalid-as-UTF-8 bytes are encountered.
  try {
    [IO.File]::ReadAllText($file, [Text.Utf8Encoding]::new($false, $true))
  } catch [System.Text.DecoderFallbackException] {
    # Fall back to Windows-1251
    $content = [IO.File]::ReadAllText($file, [Text.Encoding]::GetEncoding(1251))
    $mustReWrite = $true
  } 

  # Rewrite as UTF-8 without BOM (the .NET frameworks' default)
  if ($mustReWrite) {
    Write-Verbose "Converting from 1251 to UTF-8: $file"
    [IO.File]::WriteAllText($file, $content)
  } else {
    Write-Verbose "Already UTF-8-encoded: $file"
  }

}

^{注意:与您自己的尝试一样，上述解决方案将每个文件作为一个整体读取到内存中，但是可以更改.}

注意:

如果输入文件仅包含具有ASCII范围字符(7位)的字节，则根据定义它也是UTF-8编码的，因为UTF-8是ASCII编码的超集.

If an input file comprises only bytes with ASCII-range characters (7-bit), it is by definition also UTF-8-encoded, because UTF-8 is a superset of ASCII encoding.

在现实世界中输入的可能性很小，但是纯粹地技术上 Windows-1251编码的文件也可以也是有效的UTF-8文件， if 位模式和字节序列恰好是有效的UTF-8(关于在何处允许使用哪种位模式有严格的规定).
但是，此类文件不会包含有意义的Windows-1251内容.

It is highly unlikely with real-world input, but purely technically a Windows-1251-encoded file could be a valid UTF-8 file as well, if the bit patterns and byte sequences happen to be valid UTF-8 (which has strict rules around what bit patterns are allowed where).
Such a file would not contain meaningful Windows-1251 content, however.

没有理由在Windows-1251上实施回退策略以进行解码，因为对于在何处可能出现的位模式没有技术限制.
通常，在没有外部信息(或BOM)的情况下，没有简单且健壮的方法仅从文件的内容推断出文件的编码(尽管可以使用启发式方法).

There is no reason to implement a fallback strategy for decoding with Windows-1251, because there is no technical restrictions on what bit patterns can occur where.
Generally, in the absence of external information (or a BOM), there's no simple and no robust way to infer a file's encoding just from its content (though heuristics can be employed).

这篇关于无法在Windows中更改文本文件的编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法在Windows中更改文本文件的编码 [英] Unable to change encoding of text files in Windows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

无法在Windows中更改文本文件的编码 [英] Unable to change encoding of text files in Windows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭