在 PowerShell 中进行管道传输时,如何确保 Python 打印 UTF-8(而不是 UTF-16-LE)? [英] How to ensure Python prints UTF-8 (and not UTF-16-LE) when piped in PowerShell?

查看:24
本文介绍了在 PowerShell 中进行管道传输时,如何确保 Python 打印 UTF-8(而不是 UTF-16-LE)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在通过管道传输(例如,文件)时将文本打印为 UTF-8,因此在 Windows 10 上的 Python 3.7.3 上通过 PowerShell,我这样做:

导入系统如果不是 sys.stdout.isatty():sys.stdout.reconfigure(encoding='utf-8')打印(妈妈咪呀.")

当作为 encodingtest.py > 运行时test.txt, test.txt 然后结果是这样的:

00000000 FF FE 4D 00 61 00 6D 00 6D 00 61 00 20 00 6D 00 ÿþM.a.m.m.a.米00000010 69 00 61 00 2E 00 0D 00 0A 00 i.a......

奇怪的是,它以 FF FE 开头,它是 UTF-16-LE 的字节顺序标记 – 并且在字符之间打印空字节(因为 UTF-16 会有它)!但是,当我通过 CMD 而不是 PowerShell 运行它时,它可以很好地打印 UTF-8.即使通过 PowerShell 进行管道传输,我如何让 Python 打印 UTF-8?

我可以运行 encodingtest.py |Out-File -Encoding UTF8 test.txt 代替,但是有没有办法保证程序端的输出编码?

解决方案

PowerShell 从根本上不支持处理来自外部程序:

  • 它总是使用[Console]::OutputEncoding中存储的字符编码解码这样的输出,例如text>

  • 解码后,它使用默认字符编码进行文件输出操作,例如 >(实际上是 Out-File cmdlet),用于 > 是:

    • Windows PowerShell(最高 v5.1):Unicode",即 UTF-16LE(您所看到的)
    • PowerShell(核心,v6+):无 BOM 的 UTF-8(现在一致应用于所有 cmdlet,与 Windows PowerShell 不同).

换句话说:即使仅使用 > 也涉及字符解码和重新编码循环,原始编码和结果编码之间没有任何关系.


因此:

  • (临时)设置 [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

  • 将 Python 脚本调用的输出通过管道传送到 Out-File - 或者,如果已知输入已经是 strings(对于外部程序调用) - Set-ContentEncoding utf8.

    • 警告:在 Windows PowerShell 中,您将总是获得一个带有 BOM 的 UTF-8 文件(请参阅这个答案 的解决方法).在 PowerShell (Core) 中,您将获得一个 没有 BOM(就像默认情况下一样),但可以选择创建一个带有 -Encoding utf8BOM.

把它们放在一起(保存和恢复原始的[Console]::OutputEncoding 未显示):

[Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()编码测试.py |Set-Content -Encoding utf8 test.txt

如果您已切换到 UTF-8 系统范围,则无需修改 [Console]::OutputEncoding,如这个答案,但请注意,在撰写本文时,此 Windows 10 功能仍处于测试阶段,并且会产生深远的影响.


或者,通过 cmd.exe 调用,它确实将原始字节传递给带有 > 的文件代码>:

cmd/c 'encodingtest.py >测试.txt'

这种技术(类似地通过 /bin/sh -c 适用于类 Unix 平台)是缺少原始字节处理的通用解决方法(见下文).


背景信息:PowerShell 管道中缺乏对原始字节流的支持:

PowerShell 的管道是基于对象的,这意味着流经它的是.NET 类型的实例.传统的纯二进制管道的这种演变是 PowerShell 的强大功能和多功能性的关键.

PowerShell 中的

一切 都通过管道进行中介,包括使用重定向运算符 >,以及 ... >foo.txt 实际上是 的语法糖... |输出文件 foo.txt

  • 对于总是输出 .NET 对象的 PowerShell 原生命令某种编码形式是必要的,以便将这些对象以有意义的方式(除非对象已经是字符串,否则原始字节表示没有任何意义),因此使用基于 PowerShell 的显示输出格式系统的 text 表示(顺便说一句,这就是原因为什么带有非字符串输入的 > 通常不适合生成文件以供以后程序化 处理).

  • 对于外部程序,PowerShell 选择只通过文本(字符串)与它们通信,这在接收输出时涉及不可避免的原始解码接收到 .NET 字符串的字节,如上所述.

  • 有关详细信息,请参阅此答案.

缺乏对原始字节流的支持是问题:除非直接调用底层.NET API 来显式处理字节流(这会很麻烦),否则解码和重新编码为文本的循环:

  • 可以改变数据,不仅会干扰将字节流发送到文件,还会干扰之间/到的管道数据外部程序;有关示例,请参阅此答案.

  • 会显着降低性能.

从历史上看,当 PowerShell 是仅限 Windows 的 shell 时,这不是什么大问题,因为 Windows 世界没有很多功能强大的 CLI(命令行界面(实用程序))值得调用,因此请保持在PowerShell 的领域通常就足够了(尽管存在性能问题).

然而,在日益跨平台的世界中,尤其是在类 Unix 平台上,功能强大的 CLI 比比皆是,有时对于高性能操作来说是必不可少的.

因此,PowerShell 应该至少按需支持原始字节流,并且在检测时情况甚至自动数据在两个外部程序之间传送.请参阅 GitHub 问题 #1908GitHub 问题 #5974.

I want to print text as UTF-8 when piped (to, for example, a file), so on Python 3.7.3 on Windows 10 via PowerShell, I'm doing this:

import sys

if not sys.stdout.isatty():
    sys.stdout.reconfigure(encoding='utf-8')

print("Mamma mia.")

When run as encodingtest.py > test.txt, test.txt then turns out to be this:

00000000  FF FE 4D 00 61 00 6D 00 6D 00 61 00 20 00 6D 00  ÿþM.a.m.m.a. .m.
00000010  69 00 61 00 2E 00 0D 00 0A 00                    i.a.......

Mysteriously enough, it starts with FF FE, which is the byte-order marker for UTF-16-LE – and null bytes are printed between the characters (as UTF-16 would have it)! However, when I run it via CMD rather than PowerShell, it prints UTF-8 just fine. How do I get Python to print UTF-8 even when piped via PowerShell?

I could run encodingtest.py | Out-File -Encoding UTF8 test.txt instead, but is there a way to ensure the output encoding program-side?

解决方案

PowerShell fundamentally doesn't support processing raw output (a stream of bytes) from external programs:

  • It invariably decodes such output as text, using the character encoding stored in [Console]::OutputEncoding

  • Once decoded, it uses its default character encoding for file-output operations such as > (effectively an alias for the Out-File cmdlet), which for > are:

    • Windows PowerShell (up to v5.1): "Unicode", i.e. UTF-16LE (which is what you're seeing)
    • PowerShell (Core, v6+): BOM-less UTF-8 (now applied consistently across all cmdlets, unlike in Windows PowerShell).

In other words: Even use of just > involves a character decoding and re-encoding cycle, with no relationship between the original and the resulting encoding.


Therefore:

  • (Temporarily) set [Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

  • Pipe the output from your Python script call to Out-File - or, preferably, if the input is known to be strings already (always true for external-program calls) - Set-Content with Encoding utf8.

    • Caveat: In Windows PowerShell, you'll invariably get a UTF-8 file with a BOM (see this answer for a workaround). In PowerShell (Core), you'll get one without a BOM (as you would by default), but can opt to create one with -Encoding utf8BOM.

To put it all together (saving and restoring the original [Console]::OutputEncoding not shown):

[Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
encodingtest.py | Set-Content -Encoding utf8 test.txt

Modifying [Console]::OutputEncoding isn't necessary if you've switched to UTF-8 system-wide, as described in this answer, but note that this Windows 10 feature is still in beta as of this writing and has far-reaching consequences.


Alternatively, call via cmd.exe, which does pass the raw bytes through to a file with >:

cmd /c 'encodingtest.py > test.txt'

This technique (which analogously applies to Unix-like platforms via /bin/sh -c) is the general workaround for the lack of raw byte processing (see below).


Background information: Lack of support for raw byte streams in PowerShell's pipeline:

PowerShell's pipeline is object-based, which means that it is instances of .NET types that flow through it. This evolution of the traditional, binary-only pipeline is the key to PowerShell's power and versatility.

Everything in PowerShell is mediated via pipelines, including use of the redirection operator >, with ... > foo.txt in effect being syntactic sugar for ... | Out-File foo.txt

  • For PowerShell-native commands, which invariably output .NET objects, some form of encoding is necessary in order to write these objects to a file in a meaningful way (unless the objects are strings already, raw byte representations wouldn't make any sense), so text representations based on PowerShell's for-display output formatting systems are used (which, incidentally, is the reason why > with non-string input is generally unsuited to producing files for later programmatic processing).

  • For external programs, PowerShell has chosen to only ever communicate with them via text (strings), which on receiving output involves the inevitable decoding of the raw bytes received into .NET strings, as described above.

  • See this answer for more information.

This lack of support for raw byte streams is problematic: Unless you call the underlying .NET APIs directly to explicitly handle byte streams (which would be quite cumbersome), the cycle of decoding and re-encoding as text:

  • can alter the data, interfering not only with sending byte stream to files, but also with piping data between/to external programs; see this answer for an example.

  • can significantly degrade performance.

Historically, when PowerShell was a Windows-only shell, this wasn't much of a problem, because the Windows world didn't have many capable CLIs (command-line interfaces (utilities)) worth calling, so staying within the realm of PowerShell was usually sufficient (performance problems notwithstanding).

In an increasingly cross-platform world, however, and especially on Unix-like platforms, capable CLIs abound and are sometimes indispensable for high-performance operations.

Therefore, PowerShell should support raw byte streams at least on demand, and situationally even automatically when detecting that data is being piped between two external programs. See GitHub issue #1908 and GitHub issue #5974.

这篇关于在 PowerShell 中进行管道传输时,如何确保 Python 打印 UTF-8(而不是 UTF-16-LE)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆