将可能包含非 ASCII Unicode 字符的 PowerShell 输出解码为 Python 字符串 [英] Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string

查看:80
本文介绍了将可能包含非 ASCII Unicode 字符的 PowerShell 输出解码为 Python 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将从 Python 调用的 PowerShell 标准输出解码为 Python 字符串.

I need to decode PowerShell stdout called from Python into a Python string.

我的最终目标是以字符串列表的形式获取 Windows 上的网络适配器名称.我当前的功能看起来像这样并且在 Windows 10 上运行良好,使用英语:

My ultimate goal is to get in a form of a list of strings the names of network adapters on Windows. My current function looks like this and works well on Windows 10 with the English language:

def get_interfaces():
    ps = subprocess.Popen(['powershell', 'Get-NetAdapter', '|', 'select Name', '|', 'fl'], stdout = subprocess.PIPE)
    stdout, stdin = ps.communicate(timeout = 10)
    interfaces = []
    for i in stdout.split(b'\r\n'):
        if not i.strip():
            continue
        if i.find(b':')<0:
            continue
        name, value = [ j.strip() for j in i.split(b':') ]
        if name == b'Name':
            interfaces.append(value.decode('ascii')) # This fails for other users
    return interfaces

其他用户使用不同的语言,因此 value.decode('ascii') 对其中一些用户失败.例如.一位用户报告说,更改为 decode('ISO 8859-2') 对他来说效果很好(所以它不是 UTF-8).我如何知道编码以解码调用 PowerShell 返回的 stdout 字节?

Other users have different languages, so value.decode('ascii') fails for some of them. E.g. one user reported that changing to decode('ISO 8859-2') works well for him (so it is not UTF-8). How can I know encoding to decode the stdout bytes returned by call to PowerShell?

更新

经过一些实验,我更加困惑了.chcp 返回的控制台中的代码页是 437.我将网络适配器名称更改为包含非 ASCII 和非 cp437 字符的名称.在运行 Get-NetAdapter | 的交互式 PowerShell 会话中选择名称 |fl,它正确显示了名称,甚至是非 CP437 字符.当我从 Python 调用 PowerShell 时,非 ASCII 字符被转换为最接近的 ASCII 字符(例如,ā 到 a,ž 到 z)并且 .decode(ascii) 工作得很好.这种行为(以及相应的解决方案)是否取决于 Windows 版本?我使用的是 Windows 10,但用户可能使用的是旧版 Windows 到 Windows 7.

After some experiments I am even more confused. The codepage in my console as returned by chcp is 437. I changed the network adapter name to a name containing non-ASCII and non-cp437 characters. In an interactive PowerShell session running Get-NetAdapter | select Name | fl, it correctly displayed the name, even its non-CP437 character. When I called PowerShell from Python non-ASCII characters were converted to the closest ASCII characters (for example, ā to a, ž to z) and .decode(ascii) worked nicely. Could this behaviour (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.

推荐答案

输出字符编码可能取决于特定命令,例如:

The output character encoding may depend on specific commands e.g.:

#!/usr/bin/env python3
import subprocess
import sys

encoding = 'utf-32'
cmd = r'''$env:PYTHONIOENCODING = "%s"; py -3 -c "print('\u270c')"''' % encoding
data = subprocess.check_output(["powershell", "-C", cmd])
print(sys.stdout.encoding)
print(data)
print(ascii(data.decode(encoding)))

输出

cp437
b"\xff\xfe\x00\x00\x0c'\x00\x00\r\x00\x00\x00\n\x00\x00\x00"
'\u270c\r\n'

✌ (U+270C) 字符接收成功.

✌ (U+270C) character is received successfully.

子脚本的字符编码是在 PowerShell 会话中使用 PYTHONIOENCODING envvar 设置的.我选择了 utf-32 作为输出编码,以便它与演示的 Windows ANSI 和 OEM 代码页不同.

The character encoding of the child script is set using PYTHONIOENCODING envvar inside the PowerShell session. I've chosen utf-32 for the output encoding so that it would be different from Windows ANSI and OEM code pages for the demonstration.

请注意,父 Python 脚本的 stdout 编码是 OEM 代码页(在本例中为 cp437)——该脚本是从 Windows 控制台运行的.如果将父 Python 脚本的输出重定向到文件/管道,则 Python 3 中默认使用 ANSI 代码页(例如,cp1252).

Notice that the stdout encoding of the parent Python script is OEM code page (cp437 in this case) -- the script is run from the Windows console. If you redirect the output of the parent Python script to a file/pipe then ANSI code page (e.g., cp1252) is used by default in Python 3.

要解码可能包含在当前 OEM 代码页中无法解码的字符的 powershell 输出,您可以临时设置 [Console]::OutputEncoding(受 @eryksun 的评论):

To decode powershell output that might contain characters undecodable in the current OEM code page, you could set [Console]::OutputEncoding temporarily (inspired by @eryksun's comments):

#!/usr/bin/env python3
import io
import sys
from subprocess import Popen, PIPE

char = ord('✌')
filename = 'U+{char:04x}.txt'.format(**vars())
with Popen(["powershell", "-C", '''
    $old = [Console]::OutputEncoding
    [Console]::OutputEncoding = [Text.Encoding]::UTF8
    echo $([char]0x{char:04x}) | fl
    echo $([char]0x{char:04x}) | tee {filename}
    [Console]::OutputEncoding = $old'''.format(**vars())],
           stdout=PIPE) as process:
    print(sys.stdout.encoding)
    for line in io.TextIOWrapper(process.stdout, encoding='utf-8-sig'):
        print(ascii(line))
print(ascii(open(filename, encoding='utf-16').read()))

输出

cp437
'\u270c\n'
'\u270c\n'
'\u270c\n'

fltee 都使用 [Console]::OutputEncoding 作为 stdout(默认行为就像 | Write-输出 附加到管道).tee 使用 utf-16,将文本保存到文件中.输出显示 ✌ (U+270C) 解码成功.

Both fl and tee use [Console]::OutputEncoding for stdout (the default behavior is as if | Write-Output is appended to the pipelines). tee uses utf-16, to save a text to a file. The output shows that ✌ (U+270C) is decoded successfully.

$OutputEncoding 用于在管道中间解码字节:

$OutputEncoding is used to decode bytes in the middle of a pipeline:

#!/usr/bin/env python3
import subprocess

cmd = r'''
  $OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
  py -3 -c "import os; os.write(1, '\U0001f60a'.encode('utf-8')+b'\n')" |
  py -3 -c "import os; print(os.read(0, 512))"
'''
subprocess.check_call(["powershell", "-C", cmd])

输出

b'\xf0\x9f\x98\x8a\r\n'

正确:b'\xf0\x9f\x98\x8a'.decode('utf-8') == u'\U0001f60a'.使用默认的 $OutputEncoding (ascii),我们会得到 b'????\r\n' .

that is correct: b'\xf0\x9f\x98\x8a'.decode('utf-8') == u'\U0001f60a'. With the default $OutputEncoding (ascii) we would get b'????\r\n' instead.

注意:

  • b'\n' 被替换为 b'\r\n' 尽管使用了二进制 API,例如 os.read/os.write (msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) 在这里没有作用)
  • b'\r\n' 如果输出中没有换行符,则附加:

  • b'\n' is replaced with b'\r\n' despite using binary API such as os.read/os.write (msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) has no effect here)
  • b'\r\n' is appended if there is no newline in the output:

#!/usr/bin/env python3
from subprocess import check_output

cmd = '''py -3 -c "print('no newline in the input', end='')"'''
cat = '''py -3 -c "import os; os.write(1, os.read(0, 512))"'''  # pass as is
piped = check_output(['powershell', '-C', '{cmd} | {cat}'.format(**vars())])
no_pipe = check_output(['powershell', '-C', '{cmd}'.format(**vars())])
print('piped:   {piped}\nno pipe: {no_pipe}'.format(**vars()))

输出:

piped:   b'no newline in the input\r\n'
no pipe: b'no newline in the input'

换行符附加到管道输出.

The newline is appended to the piped output.

如果我们忽略单独的代理,那么设置 UTF8Encoding 允许通过管道传递所有 Unicode 字符,包括非 BMP 字符.如果配置了 $env:PYTHONIOENCODING = "utf-8:ignore",则可以在 Python 中使用文本模式.

If we ignore lone surrogates then setting UTF8Encoding allows to pass via pipes all Unicode characters including non-BMP characters. Text mode could be used in Python if $env:PYTHONIOENCODING = "utf-8:ignore" is configured.

在交互式 powershell 中运行 Get-NetAdapter |选择名称 |fl 正确显示名称,即使是非 cp437 字符.

In interactive powershell running Get-NetAdapter | select Name | fl displayed correctly the name even its non-cp437 character.

如果标准输出未重定向,则使用 Unicode API,将字符打印到控制台 -- 如果控制台 (TrueType) 字体支持,则可以显示任何 [BMP] Unicode 字符.

If stdout is not redirected then Unicode API is used, to print characters to the console -- any [BMP] Unicode character can be displayed if the console (TrueType) font supports it.

当我从 python 调用 powershell 时,非 ascii 字符被转换为最接近的 ascii 字符(例如 ā 到 a,ž 到 z)并且 .decode(ascii) 工作得很好.

When I called powershell from python non-ascii characters were converted to closest ascii characters (e.g. ā to a, ž to z) and .decode(ascii) worked nicely.

这可能是由于为 [Console]::OutputEncoding 设置了 System.Text.InternalDecoderBestFitFallback -- 如果 Unicode 字符无法以给定的编码进行编码然后将其传递给回退(使用最适合的字符或 '?' 而不是原始字符).

It might be due to System.Text.InternalDecoderBestFitFallback set for [Console]::OutputEncoding -- if a Unicode character can't be encoded in a given encoding then it is passed to the fallback (either a best fit char or '?' is used instead of the original character).

此行为(以及相应的解决方案)是否取决于 Windows 版本?我使用的是 Windows 10,但用户可能使用的是旧版 Windows 到 Windows 7.

Could this behavior (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.

如果我们忽略 cp65001 中的错误和更高版本支持的新编码列表,那么行为应该是相同的.

If we ignore bugs in cp65001 and a list of new encodings that are supported in later versions then the behavior should be the same.

这篇关于将可能包含非 ASCII Unicode 字符的 PowerShell 输出解码为 Python 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆