如何在没有BOM的情况下在PowerShell中重定向输入? [英] How can I redirect input in PowerShell without a BOM?

查看:145
本文介绍了如何在没有BOM的情况下在PowerShell中重定向输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过以下方式重定向PowerShell中的输入:

I am trying to redirect input in PowerShell by:

Get-Content input.txt | my-program args

问题在于,在管道传输的UTF-8文本之前添加了BOM(0xEFBBBF),我的程序无法正确处理它.

The problem is the piped UTF-8 text is preceded with a BOM (0xEFBBBF), and my program cannot handle that correctly.

一个最小的工作示例:

// File: Hex.java
import java.io.IOException;

public class Hex {
    public static void main(String[] dummy) {
        int ch;
        try {
            while ((ch = System.in.read()) != -1) {
                System.out.print(String.format("%02X ", ch));
            }
        } catch (IOException e) {
        }
    }
}

然后在PowerShell中

Then in PowerShell:

javac Hex.java
Set-Content textfile "ABC" -Encoding Ascii
# Now the content of textfile is 0x41 42 43 0D 0A
Get-Content textfile | java Hex

或者简单地

javac Hex.java
Write-Output "ABC" | java Hex

在任何一种情况下,输出均为EF BB BF 41 42 43 0D 0A.

In either case, the output is EF BB BF 41 42 43 0D 0A.

如何在不使用0xEFBBBF的情况下将文本通过管道传递到程序中?

How can I pipe the text into the program without 0xEFBBBF?

推荐答案

注意:以下内容包含一些常规信息,这些信息在正常运行的PowerShell环境中将解释OP的症状.该解决方案在OP的情况下不起作用的原因是此时尚不知道特定于机器的原因.

要确保Java程序收到其输入的UTF-8编码的没有BOM ,您必须将$OutputEncoding设置为

To ensure that your Java program receives its input UTF-8-encoded without a BOM, you must set $OutputEncoding to a System.Text.UTF8Encoding instance that does not emit a BOM:

# Assigns UTF-8 encoding *without a BOM*.
# PowerShell uses this encoding to encode data piped to external programs.
# $OutputEncoding defaults to ASCII(!) in Windows PowerShell, and more sensibly
# to BOM-*less* UTF-8 in PowerShell [Core] v6+
$OutputEncoding = [Text.UTF8Encoding]::new($false)

注意事项:请勿使用看似等效的New-Object Text.Utf8Encoding $false ,因为由于

Caveat: Do NOT use the seemingly equivalent New-Object Text.Utf8Encoding $false, because, due to the bug described in this GitHub issue, it won't work if you assign to $OutpuEncoding in a non-global scope, such as in a script.

相比之下,如果您使用[Text.Encoding]::Utf8(

If, by contrast, you use [Text.Encoding]::Utf8 (System.Text.Encoding.UTF8), you will get a BOM - which is what I suspect happened in your case.

请注意,此问题与Get-Content读取的任何文件的源编码无关,因为通过PowerShell管道发送的内容永远不会是原始字节的流,而是 .NET对象,在Get-Content情况下,这意味着发送.NET 字符串 (

Note that this problem is unrelated to the source encoding of any file read by Get-Content, because what is sent through the PowerShell pipeline is never a stream of raw bytes, but .NET objects, which in the case of Get-Content means that .NET strings are sent (System.String, internally a sequence of UTF-16 code units).

由于您正在传递到外部程序(在您的情况下为Java应用程序),因此PowerShell会根据以下内容对发送给它的(按需字符串化)对象进行字符编码首选项变量$OutputEncoding ,那么最终的编码就是外部程序收到的内容.

Because you're piping to an external program (a Java application, in your case), PowerShell character-encodes the (stringified-on-demand) objects sent to it based on preference variable $OutputEncoding, and the resulting encoding is what the external program receives.

令人惊讶的是,即使BOM通常仅在文件中使用 ,PowerShell也会在管道中使用分配给$OutputEncoding的编码的BOM设置 ,放在发送的第一行(仅限)之前.

Perhaps surprisingly, even though BOMs are typically only used in files, PowerShell respects the BOM setting of the encoding assigned to $OutputEncoding also in the pipeline, prepending it to the first line sent (only).

请参阅此答案的底部,以获取有关PowerShell如何处理外部程序的管道输入和输出的更多信息,包括 [Console]::OutputEncoding的重要性,当PowerShell解释从外部程序 接收的数据时.

See the bottom section of this answer for more information about how PowerShell handles pipeline input for and output from external programs, including how it is [Console]::OutputEncoding that matters when PowerShell interprets data received from external programs.

为说明使用示例程序的区别(请注意,使用PowerShell字符串文字作为输入已足够;无需从文件读取):

To illustrate the difference using your sample program (note how using a PowerShell string literal as input is sufficient; no need to read from a file):

# Note the EF BB BF sequence representing the UTF-8 BOM.
# Enclosure in & { ... } ensures that a local, temporary copy of $OutputEncoding
# is used.
PS> & { $OutputEncoding = [Text.Encoding]::Utf8; 'hö' | java Hex }
EF BB BF 68 C3 B6 0D 0A

# Note the absence of EF BB BF, due to using a BOM-less
# UTF-8 encoding.
PS> & { $OutputEncoding = [Text.Utf8Encoding]::new($false); 'hö' | java Hex }
68 C3 B6 0D 0A

Windows PowerShell 中,其中$OutputEncoding默认为ASCII(!),您将看到以下内容以及默认位置:

In Windows PowerShell, where $OutputEncoding defaults to ASCII(!), you'd see the following with the default in place:

# The default of ASCII(!) results in *lossy* encoding in Windows PowerShell.
PS> 'hö' | java Hex 
68 3F 0D 0A

请注意,3F代表文字?字符,这也是非ASCII ö字符也被音译的原因,因为它没有以ASCII表示;换句话说:信息丢失了.

Note that 3F represents the literal ? character, which is what the non-ASCII ö character was transliterated too, given that it has no representation in ASCII; in other words: information was lost.

PowerShell [Core] v6 + 现在可以合理地默认为无BOM的UTF-8,因此该默认行为符合预期.
尽管无BOM的UTF-8是PowerShell [Core]的一致的默认设置,但对于在Windows中的 Windows [Console]::OutputEncoding读取和写入文件的cmdlet仍然反映了从v7.0开始,默认情况下,活动的OEM代码页为默认状态,因此要正确捕获发出UTF-8的外部程序的输出,也必须将其设置为[Text.UTF8Encoding]::new($false)-请参见

PowerShell [Core] v6+ now sensibly defaults to BOM-less UTF-8, so the default behavior there is as expected.
While BOM-less UTF-8 is PowerShell [Core]'s consistent default, also for cmdlets that read from and write to files, on Windows [Console]::OutputEncoding still reflects the active OEM code page by default as of v7.0, so to correctly capture output from UTF-8-emitting external programs, it must be set to [Text.UTF8Encoding]::new($false) as well - see this GitHub issue.

这篇关于如何在没有BOM的情况下在PowerShell中重定向输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆