无法在Windows中更改文本文件的编码 [英] Unable to change encoding of text files in Windows

查看:76
本文介绍了无法在Windows中更改文本文件的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些具有不同编码的文本文件.其中一些是 UTF-8 ,而另一些则是 windows-1251 编码.我试图执行以下递归脚本,将其全部编码为 UTF-8 .

I have some text files with different encodings. Some of them are UTF-8 and some others are windows-1251 encoded. I tried to execute following recursive script to encode it all to UTF-8.

Get-ChildItem *.nfo -Recurse | ForEach-Object {
$content = $_ | Get-Content

Set-Content -PassThru $_.Fullname $content -Encoding UTF8 -Force}  

此后,由于UTF-8编码也有错误的编码,因此我无法在Java程序中使用文件,因此无法获取原始文本.对于Windows-1251编码的文件,与原始文件一样,我得到的输出为空.因此,它会破坏已被UTF-8编码的文件.

After that I am unable to use files in my Java program, because UTF-8 encoded has also wrong encoding, I couldn't get back original text. In case of windows-1251 encoded files I get empty output as in case of original files. So it makes corrupt already UTF-8 encoded files.

我找到了另一个解决方案 iconv ,但是据我所知,它需要当前编码作为参数.

I found another solution, iconv, but as I see it needs current encoding as parameter.

$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile 

不同编码的文件混合在文件夹结构中,因此文件应保留在同一路径上.

Differently encoded files are mixed in a folder structure, so files should stay on same path.

系统使用代码页852.现有的UTF-8文件没有BOM.

System uses Code page 852. Existing UTF-8 files are without BOM.

推荐答案

在Windows PowerShell中,由于以下两个原因,您将无法使用内置cmdlet:

In Windows PowerShell you won't be able to use the built-in cmdlets for two reasons:

  • 从您的OEM代码页为 852 ,我推断您的"ANSI"代码页为 Windows-1250 (均由旧版系统区域设置定义),与您的 Windows-1251 编码的输入文件不匹配.

  • From your OEM code page being 852 I infer that your "ANSI" code page is Windows-1250 (both defined by the legacy system locale), which doesn't match your Windows-1251-encoded input files.

Set-Content (及类似内容)与一起使用-编码UTF8 不变会创建具有="https://en.wikipedia.org/wiki/Byte_order_mark" rel ="nofollow noreferrer"> BOM(字节顺序标记) ,它是Java以及更广泛的Unix继承实用程序不明白.

Using Set-Content (and similar) with -Encoding UTF8 invariably creates files with a BOM (byte-order mark), which Java and, more generally, Unix-heritage utilities don't understand.

注意:PowerShell Core 实际上默认为 BOM-less UTF8,并且还允许您传递任何可用的 [System.Text.Encoding] 实例添加到 -Encoding 参数,因此您可以使用那里的内置cmdlet解决问题,而只需要直接使用.NET框架来构造编码实例.

Note: PowerShell Core actually defaults to BOM-less UTF8 and also allows you to pass any available [System.Text.Encoding] instance to the -Encoding parameter, so you could solve your problem with the built-in cmdlets there, while needing direct use of the .NET framework only to construct an encoding instance.

因此,您必须直接使用.NET框架:

You must therefore use the .NET framework directly:

Get-ChildItem *.nfo -Recurse | ForEach-Object {

  $file = $_.FullName

  $mustReWrite = $false
  # Try to read as UTF-8 first and throw an exception if 
  # invalid-as-UTF-8 bytes are encountered.
  try {
    [IO.File]::ReadAllText($file, [Text.Utf8Encoding]::new($false, $true))
  } catch [System.Text.DecoderFallbackException] {
    # Fall back to Windows-1251
    $content = [IO.File]::ReadAllText($file, [Text.Encoding]::GetEncoding(1251))
    $mustReWrite = $true
  } 

  # Rewrite as UTF-8 without BOM (the .NET frameworks' default)
  if ($mustReWrite) {
    Write-Verbose "Converting from 1251 to UTF-8: $file"
    [IO.File]::WriteAllText($file, $content)
  } else {
    Write-Verbose "Already UTF-8-encoded: $file"
  }

}

注意:与您自己的尝试一样,上述解决方案将每个文件作为一个整体读取到内存中,但是可以更改.

注意:

  • 如果输入文件仅包含具有ASCII范围字符(7位)的字节,则根据定义它也是UTF-8编码的,因为UTF-8是ASCII编码的超集.

  • If an input file comprises only bytes with ASCII-range characters (7-bit), it is by definition also UTF-8-encoded, because UTF-8 is a superset of ASCII encoding.

在现实世界中输入的可能性很小,但是纯粹地技术上 Windows-1251编码的文件也可以也是有效的UTF-8文件, if 位模式和字节序列恰好是有效的UTF-8(关于在何处允许使用哪种位模式有严格的规定).
但是,此类文件不会包含有意义的Windows-1251内容.

It is highly unlikely with real-world input, but purely technically a Windows-1251-encoded file could be a valid UTF-8 file as well, if the bit patterns and byte sequences happen to be valid UTF-8 (which has strict rules around what bit patterns are allowed where).
Such a file would not contain meaningful Windows-1251 content, however.

没有理由在Windows-1251上实施回退策略以进行解码,因为对于在何处可能出现的位模式没有技术限制.
通常,在没有外部信息(或BOM)的情况下,没有简单且健壮的方法仅从文件的内容推断出文件的编码(尽管可以使用启发式方法).

There is no reason to implement a fallback strategy for decoding with Windows-1251, because there is no technical restrictions on what bit patterns can occur where.
Generally, in the absence of external information (or a BOM), there's no simple and no robust way to infer a file's encoding just from its content (though heuristics can be employed).

这篇关于无法在Windows中更改文本文件的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆