在命令提示符/Windows Powershell (Windows 10) 中使用 UTF-8 编码 (CHCP 65001) [英] Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10)

查看:63
本文介绍了在命令提示符/Windows Powershell (Windows 10) 中使用 UTF-8 编码 (CHCP 65001)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一段时间以来,我一直在命令提示符和 Windows Powershell 中强制使用 chcp 65001,但从 SO 和其他几个社区上的问答帖子来看,它

  • 设置系统的活动OEMANSI代码页为65001,UTF-8代码页,因此 (a) 使所有未来的 控制台窗口,使用 OEM 代码页,默认为 UTF-8(就像 chcp65001 已在 cmd.exe 窗口中执行)并且 (b) 还制作了传统的非 Unicode GUI 子系统应用程序,其中(以及其他)使用 ANSI 代码页,使用 UTF-8.

    • 注意事项:

      • 如果您使用 Windows PowerShell,这也会使 Get-ContentSet-Content 和 Windows PowerShell 默认的其他上下文,因此系统的活动 ANSI 代码页,特别是从无 BOM 文件中读取源代码默认为 UTF-8(PowerShell Core (v6+) 总是这样做).这意味着,在没有 -Encoding 参数的情况下,ANSI 编码的无 BOM 文件(这在历史上很常见)将被误读,并且使用 Set-Content 创建的文件 将是 UTF-8 而不是 ANSI 编码.

      • [已在 PowerShell 7.1 中修复] 至少 PowerShell 7.0,底层 .NET 版本(.NET Core 3.1)一个错误导致 PowerShell 中的后续错误:UTF-8 BOM 意外地附加到通过标准输入发送到外部进程的数据中(无论您设置什么 $OutputEncoding到),特别是破坏了Start-Job - 请参阅这个 GitHub 问题.

      • 并非所有字体都使用 Unicode,因此请选择 TT (TrueType) 字体,但即使它们通常也仅支持所有字符的子集,因此您可以必须尝试使用​​特定字体以查看您关心的所有字符是否都被表示 - 有关详细信息,请参阅此答案,其中还讨论了具有更好 Unicode 渲染支持的替代控制台(终端)应用程序.

      • 正如 eryksun 指出的那样,不说话"的传统控制台应用程序UTF-8 将仅限于仅 ASCII 输入,并且在尝试输出(7 位)ASCII 范围之外的字符时会产生错误输出.(在过时的 Windows 7 及更低版本中,程序甚至可能崩溃).
        如果运行旧控制台应用程序对您很重要,请在评论中查看 eryksun 的建议.

  • 但是,对于Windows PowerShell来说,这还不够:

    • 您还必须另外$OutputEncoding 首选项变量设置为 UTF-8:$OutputEncoding = [System.Text.UTF8Encoding]::new()[2];将该命令添加到您的 $PROFILE(仅限当前用户)或 $PROFILE.AllUsersCurrentHost(所有用户)文件中是最简单的.
    • 幸运的是,这在 PowerShell Core 中不再需要,它在内部始终默认为无 BOM 的 UTF-8.

如果在您的环境中将系统区域设置设置为 UTF-8不是,请改用启动命令:

注意:上面提到的遗留控制台应用程序的警告同样适用于此.如果运行旧控制台应用程序对您很重要,请在评论中查看 eryksun 的建议.

  • 对于 PowerShell(两个版本),将以下行添加到您的 $PROFILE(仅限当前用户)或 $PROFILE.AllUsersCurrentHost(所有用户)文件,相当于chcp 65001,辅以设置首选项变量$OutputEncoding来指示PowerShell通过管道向外部程序发送数据UTF-8:

    • 请注意,从 PowerShell 会话内部运行 chcp 65001 无效,因为 .NET 在启动时缓存控制台的输出编码,并且不知道以后使用 chcp 所做的更改;此外,如上所述,Windows PowerShell 需要设置 $OutputEncoding - 参见 this回答了解详情.

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding

  • 例如,这是一种以编程方式将此行添加到 $PROFILE 的快速而肮脏的方法:

'$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding' + [Environment]::Newline + (Get-Content -Raw $PROFILE -ErrorAction SilentlyContinue) |Set-Content -Encoding utf8 $PROFILE

  • 对于cmd.exe,通过注册表定义一个自动运行命令,在键的值AutoRun中HKEY_CURRENT_USERSoftwareMicrosoftCommand Processor(仅限当前用户)或 HKEY_LOCAL_MACHINESoftwareMicrosoftCommand Processor(所有用户):

    • 例如,您可以使用 PowerShell 为您创建此值:

# 每当当前用户打开 `cmd.exe` 控制台时自动执行 `chcp 65001`# 窗口(包括运行批处理文件时):Set-ItemProperty 'HKCU:SoftwareMicrosoftCommand Processor' AutoRun 'chcp 65001 >NUL'


可选阅读:为什么 Windows PowerShell ISE 是一个糟糕的选择:

虽然 ISE 确实比控制台具有更好的 Unicode 渲染支持,但它通常是一个糟糕的选择:

  • 首先,ISE 已经过时:它不支持 PowerShell Core,所有未来的开发都将在那里进行,而且它不是交叉的-platform,与两个 PowerShell 版本的新主要 IDE 不同,Visual Studio Code 已经支持 UTF-8默认情况下为 PowerShell Core 设置,并且可以针对 Windows PowerShell 进行配置.

  • ISE 通常是一个用于开发脚本的环境,而不是用于在生产环境中运行它们(如果您(也)为其他人编写脚本,您应该假设它们将在控制台中运行);值得注意的是,在运行脚本时,ISE 的行为在所有方面都不尽相同.

  • 正如 eryksun 指出的那样,ISE 不支持运行交互式外部控制台程序,即需要用户输入的程序:

<块引用>

问题在于它隐藏了控制台并将进程输出(但不是输入)重定向到管道.当文件是管道时,大多数控制台应用程序切换到全缓冲.此外,交互式应用程序需要从标准输入读取,而这在隐藏的控制台窗口中是不可能的.(它可以通过 ShowWindow 取消隐藏,但单独的输入窗口很笨重.)

  • 如果您愿意忍受该限制,将活动代码页切换到 65001 (UTF-8) 以与外部程序正确通信需要一个笨拙的解决方法:

    • 您必须首先通过从内置控制台运行任何外部程序来强制创建隐藏的控制台窗口,例如,chcp - 你会看到控制台窗口短暂闪烁.

    • 只有then可以将[console]::OutputEncoding(和$OutputEncoding)设置为UTF-8,如图以上(如果隐藏控制台尚未创建,您将收到句柄无效错误).


[1] 在 PowerShell 中,如果您从不调用 外部 程序,则无需担心系统区域设置(活动代码页):PowerShell 本地命令和 .NET 调用始终通过 UTF-16 字符串(本机 .NET 字符串)和文件 I/O 进行通信,应用独立于系统区域设置的默认编码.同样,由于 Unicode 版本的 Windows API 函数用于打印到控制台和从控制台读取,因此非 ASCII 字符始终可以正确打印(在控制台的渲染限制范围内).
相比之下,在 cmd.exe 中,系统区域设置对文件 I/O 很重要(使用 <> 重定向,但特别包括批处理文件源代码采用何种编码),而不仅仅是用于与内存中的外部程序通信(例如在 for/f 循环中读取程序输出时).

[2] 在 PowerShell v4- 中,静态 ::new() 方法不可用,请使用 $OutputEncoding = (New-Object System.Text.UTF8Encoding).psobject.BaseObject.请参阅 GitHub 问题 #5763 了解为什么 .psobject.BaseObject 需要部分.

I've been forcing the usage of chcp 65001 in Command Prompt and Windows Powershell for some time now, but judging by Q&A posts on SO and several other communities it seems like a dangerous and inefficient solution. Does Microsoft provide an improved / complete alternative to chcp 65001 that can be saved permanently without manual alteration of the Registry? And if there isn't, is there a publicly announced timeline or agenda to support UTF-8 in the Windows CLI in the future?

Personally I've been using chcp 949 for Korean Character Support, but the weird display of the backslash and incorrect/incomprehensible displays in several applications (like Neovim), as well as characters that aren't Korean not being supported via 949 seems to become more of a problem lately.

解决方案

Note:

  • This answer shows how to switch the character encoding in the Windows console to
    UTF-8 (code page 65001), so that shells such as cmd.exe and PowerShell properly encode and decode characters (text) when communicating with external (console) programs with full Unicode support, and in cmd.exe also for file I/O.[1]

  • If, by contrast, your concern is about the separate aspect of the limitations of Unicode character rendering in console windows, see the middle and bottom sections of this answer, where alternative console (terminal) applications are discussed too.


Does Microsoft provide an improved / complete alternative to chcp 65001 that can be saved permanently without manual alteration of the Registry?

As of (at least) Windows 10, version 1903, you have the option to set the system locale (language for non-Unicode programs) to UTF-8, but the feature is still in beta as of this writing.

To activate it:

  • Run intl.cpl (which opens the regional settings in Control Panel)
  • Follow the instructions in the screen shot below.

  • This sets both the system's active OEM and the ANSI code page to 65001, the UTF-8 code page, which therefore (a) makes all future console windows, which use the OEM code page, default to UTF-8 (as if chcp 65001 had been executed in a cmd.exe window) and (b) also makes legacy, non-Unicode GUI-subsystem applications, which (among others) use the ANSI code page, use UTF-8.

    • Caveats:

      • If you're using Windows PowerShell, this will also make Get-Content and Set-Content and other contexts where Windows PowerShell default so the system's active ANSI code page, notably reading source code from BOM-less files, default to UTF-8 (which PowerShell Core (v6+) always does). This means that, in the absence of an -Encoding argument, BOM-less files that are ANSI-encoded (which is historically common) will then be misread, and files created with Set-Content will be UTF-8 rather than ANSI-encoded.

      • [Fixed in PowerShell 7.1] Up to at least PowerShell 7.0, a bug in the underlying .NET version (.NET Core 3.1) causes follow-on bugs in PowerShell: a UTF-8 BOM is unexpectedly prepended to data sent to external processes via stdin (irrespective of what you set $OutputEncoding to), which notably breaks Start-Job - see this GitHub issue.

      • Not all fonts speak Unicode, so pick a TT (TrueType) font, but even they usually support only a subset of all characters, so you may have to experiment with specific fonts to see if all characters you care about are represented - see this answer for details, which also discusses alternative console (terminal) applications that have better Unicode rendering support.

      • As eryksun points out, legacy console applications that do not "speak" UTF-8 will be limited to ASCII-only input and will produce incorrect output when trying to output characters outside the (7-bit) ASCII range. (In the obsolescent Windows 7 and below, programs may even crash).
        If running legacy console applications is important to you, see eryksun's recommendations in the comments.

  • However, for Windows PowerShell, that is not enough:

    • You must additionally set the $OutputEncoding preference variable to UTF-8 as well: $OutputEncoding = [System.Text.UTF8Encoding]::new()[2]; it's simplest to add that command to your $PROFILE (current user only) or $PROFILE.AllUsersCurrentHost (all users) file.
    • Fortunately, this is no longer necessary in PowerShell Core, which internally consistently defaults to BOM-less UTF-8.

If setting the system locale to UTF-8 is not an option in your environment, use startup commands instead:

Note: The caveat re legacy console applications mentioned above equally applies here. If running legacy console applications is important to you, see eryksun's recommendations in the comments.

  • For PowerShell (both editions), add the following line to your $PROFILE (current user only) or $PROFILE.AllUsersCurrentHost (all users) file, which is the equivalent of chcp 65001, supplemented with setting preference variable $OutputEncoding to instruct PowerShell to send data to external programs via the pipeline in UTF-8:

    • Note that running chcp 65001 from inside a PowerShell session is not effective, because .NET caches the console's output encoding on startup and is unaware of later changes made with chcp; additionally, as stated, Windows PowerShell requires $OutputEncoding to be set - see this answer for details.

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding

  • For example, here's a quick-and-dirty approach to add this line to $PROFILE programmatically:

'$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding' + [Environment]::Newline + (Get-Content -Raw $PROFILE -ErrorAction SilentlyContinue) | Set-Content -Encoding utf8 $PROFILE

  • For cmd.exe, define an auto-run command via the registry, in value AutoRun of key HKEY_CURRENT_USERSoftwareMicrosoftCommand Processor (current user only) or HKEY_LOCAL_MACHINESoftwareMicrosoftCommand Processor (all users):

    • For instance, you can use PowerShell to create this value for you:

# Auto-execute `chcp 65001` whenever the current user opens a `cmd.exe` console
# window (including when running a batch file):
Set-ItemProperty 'HKCU:SoftwareMicrosoftCommand Processor' AutoRun 'chcp 65001 >NUL'


Optional reading: Why the Windows PowerShell ISE is a poor choice:

While the ISE does have better Unicode rendering support than the console, it is generally a poor choice:

  • First and foremost, the ISE is obsolescent: it doesn't support PowerShell Core, where all future development will go, and it isn't cross-platform, unlike the new premier IDE for both PowerShell editions, Visual Studio Code, which already speaks UTF-8 by default for PowerShell Core and can be configured to do so for Windows PowerShell.

  • The ISE is generally an environment for developing scripts, not for running them in production (if you're writing scripts (also) for others, you should assume that they'll be run in the console); notably, the ISE's behavior is not the same in all aspects when it comes to running scripts.

  • As eryksun points out, the ISE doesn't support running interactive external console programs, namely those that require user input:

The problem is that it hides the console and redirects the process output (but not input) to a pipe. Most console applications switch to full buffering when a file is a pipe. Also, interactive applications require reading from stdin, which isn't possible from a hidden console window. (It can be unhidden via ShowWindow, but a separate window for input is clunky.)

  • If you're willing to live with that limitation, switching the active code page to 65001 (UTF-8) for proper communication with external programs requires an awkward workaround:

    • You must first force creation of the hidden console window by running any external program from the built-in console, e.g., chcp - you'll see a console window flash briefly.

    • Only then can you set [console]::OutputEncoding (and $OutputEncoding) to UTF-8, as shown above (if the hidden console hasn't been created yet, you'll get a handle is invalid error).


[1] In PowerShell, if you never call external programs, you needn't worry about the system locale (active code pages): PowerShell-native commands and .NET calls always communicate via UTF-16 strings (native .NET strings) and on file I/O apply default encodings that are independent of the system locale. Similarly, because the Unicode versions of the Windows API functions are used to print to and read from the console, non-ASCII characters always print correctly (within the rendering limitations of the console).
In cmd.exe, by contrast, the system locale matters for file I/O (with < and > redirections, but notably including what encoding to assume for batch-file source code), not just for communicating with external programs in-memory (such as when reading program output in a for /f loop).

[2] In PowerShell v4-, where the static ::new() method isn't available, use $OutputEncoding = (New-Object System.Text.UTF8Encoding).psobject.BaseObject. See GitHub issue #5763 for why the .psobject.BaseObject part is needed.

这篇关于在命令提示符/Windows Powershell (Windows 10) 中使用 UTF-8 编码 (CHCP 65001)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆