Linux和Windows之间的文本编码 [英] Text Encoding between Linux and Windows
问题描述
我的主要问题是如何获取Linux中的文本文件以在PowerShell中正确显示.
在Linux中,我的文本文件带有一些特殊字符,实际上,记事本显示的文本文件与Linux中显示的完全相同:
不幸的是,我的程序会打印到Linux终端,因此我的Windows终端需要相同的输出.通过其他答案,我已经看到
- 我需要使用TrueType字体,所以我正在使用Lucidia Console
- 在我的Linux设备上,编码为UTF-8.根据我在网上可以找到的每个答案,CHCP 65001将PowerShell中的代码页切换为UTF-8
- Windows Powershell可以更好地显示内容,因此尽管我尝试使用命令提示符,但现在可以在PowerShell中工作.
使用CHCP 65001,然后键入
更多my_file.txt
显示以下内容:
使用
时 获取内容-编码UTF8 my_file.txt
输出:
这两个结果都不够好,但是我实际上担心Get-Content在这里所做的完全不同.我要传输到Windows的代码是用Free Pascal编写的,在Free Pascal中,我可以提供UTF-8代码页,仅此而已.因此,虽然Get-Content对我来说是检查PowerShell是否能够产生所需输出的好命令,但对我来说使用它并不实际.在Pascal中,输出(写入到PowerShell显示中)显示为:
这也是不好的,这些行应该连接,因为它们在Linux中是行的(显然某些字符被解释为?).但是,这可能是在Pascal中选择的代码页的问题,这将是下一步.
我现在的问题是,如何使Windows Powershell在默认情况下显示与记事本版本中显示的文本文件相同的文件.对我来说,在任何地方在代码中运行Get-Content都是不切实际的,因此,尽管该结果看起来更有希望,但我无法遵循.
作为后续问题,因为我无法在线上找到它,所以在显示内容时这里的主要参与者是什么,因为它显然比编码更重要.为什么更多"和获取内容"命令显示不同的输出?为什么获取内容"无法读取所有内容?我以为UTF-8是一个通用标准,并且能够读取UTF-8的程序至少可以实际读取所有字符,但是它们的读取方式有所不同.
以文本形式输入的是:
╭──────╮││╭─│───╮│││││││╭─│────╮│││││││╭─│───││╯│││││││││╰─╯│││││││╰────────│─╯││╰─────────╯
针对下面发布的答案,我可以看到
更多my_file.txt
产生
的更多命令输出
使用
时 $ OutputEncoding = [控制台] :: InputEncoding = [控制台] :: OutputEncoding =新对象System.Text.UTF8Encoding
-
请确保您的UTF-8编码文本文件具有BOM -否则, Windows PowerShell将误解您的文件是根据系统活动的ANSI代码页进行编码的(而遗憾的是,PowerShell [Core] 6+现在始终在没有BOM的情况下始终默认为UTF-8).
-
或者,使用
Get-Content -Encoding Utf8 my_file.txt
显式指定文件的编码. -
有关Windows PowerShell与PowerShell [Core]中字符编码的全面讨论,请参见,如您的问题所示),但是例如
Consolas
(该PowerShell[Core]默认使用6 +),可以.
始终与外部程序一起使用UTF-8编码 :
注意:
-
下面的命令既不是必需的,也不会对 PowerShell 命令(如
Get-Content
cmdlet)产生任何影响. -
某些旧版控制台应用程序-特别是
more.com
(Windows PowerShell将其包装为more
功能)-根本不支持Unicode ,仅支持旧版OEM代码页. [*]
根据我在网上可以找到的每个答案,CHCP 65001将PowerShell中的代码页切换为UTF-8
如果
chcp 65001
如果在PowerShell中用 运行,则不不起作用,因为.NET 会缓存 PowerShell会话启动时的[Console] :: OutputEncoding
值,以及当时有效的代码页.相反,您可以使用以下命令使控制台窗口完全支持UTF-8(这也将使
chcp
随后报告65001
):$ OutputEncoding = [控制台] :: InputEncoding = [控制台] :: OutputEncoding =新对象System.Text.UTF8Encoding
这使得PowerShell将外部程序的输出解释为UTF-8,并且还将发送给 的数据编码为UTF-8(这要归功于首选项变量
$ OutputEncoding
).有关更多信息,请参见此答案.
[*]启用UTF-8代码页
65001
后,more
安静地跳过包含至少一个字符的行无法映射到系统OEM代码页上的Unicode字符(系统单字节OEM代码页上不存在的任何字符,只能表示256个字符),在这种情况下,该字符适用于包含圆角字符的行例如╭
(框图,右下角的框,U + 256D
).The main question I have is how can I get a textfile that I have in Linux to display properly in PowerShell.
In Linux, I have text files with some special characters, and in fact Notepad displays the text file exactly as it is displayed in Linux:
Unfortunately, my program prints to my Linux Terminal, and thus I need the same output in my Windows terminal. I have seen through other answers that
- I need to use a TrueType font, so I am using Lucidia Console
- on my Linux device, the encoding is UTF-8. According to every answer I can find online, CHCP 65001 switches the code page in PowerShell to UTF-8
- Windows Powershell is better equipped to display content, so while I have tried using the command prompt, I am now working in PowerShell.
Using CHCP 65001 and then typing
more my_file.txt
displays this:
while using
Get-Content -Encoding UTF8 my_file.txt
outputs:
Neither of these results is good enough, but I am actually concerned that Get-Content does something different at all here. The code that I am transferring to windows is written in Free Pascal, and in Free Pascal, I can provide a UTF-8 codepage, but that's it. So while Get-Content is a good command for me to check if PowerShell is capable of producing the desired output, it is not practical for me to use it. In Pascal, the output (which is written to the PowerShell display) appears as:
Which is bad as well, those lines should connect because they do in Linux (and obviously some characters are interpreted just as ?). However, this might be a problem with the codepage picked in Pascal, which would be a next step.
My question right now is, how can I get the Windows Powershell to, by default, display a text file as it is shown in the notepad version. It is not practical for me to run Get-Content in my code everywhere, so although that result appears more promising, I cannot follow that.
As a follow up question, because I could not find it anywhere online, what are the main players here when it comes to displaying content, because it is clearly a bigger story than just the encoding. Why are the 'more' and the 'Get-Content' commands displaying different outputs? And why can 'Get-Content' not read all of the content? I had assumed UTF-8 was a universal standard, and programs who can read UTF-8 could at least actually read all of the characters, but they're all reading it differently.
The input, as text, is:
╭─────╮ │ │ ╭─│───╮ │ │ │ │ │ │ │ ╭─│───╮ │ │ │ │ │ │ ╭─│───│─╯ │ │ │ │ │ │ │ │ │ │ ╰─╯ │ │ │ │ │ │ │ ╰───────│─╯ │ │ ╰─────────╯
In response to an answer posted below, I can see that
more my_file.txt
produces
when using
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
解决方案Make sure that your UTF-8-encoded text file has a BOM - otherwise, your file will be misinterpreted by Windows PowerShell as being encoded based on the system's active ANSI code page (whereas PowerShell [Core] 6+ now thankfully consistently defaults to UTF-8 in the absence of a BOM).
Alternatively, use
Get-Content -Encoding Utf8 my_file.txt
to explicitly specify the file's encoding.For a comprehensive discussion of character encoding in Windows PowerShell vs. PowerShell [Core], see this answer.
For output from external programs to be correctly captured in a variable or correctly redirect to a file, you need to set
[Console]::OutputEncoding
to the character encoding that the given program uses on output (for mere printing to the display this may not be necessary, however):If code page
65001
(UTF-8) is in effect and your program honors that, you'll need to set[Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
; see below for how to ensure that65001
is truly in effect, given that runningchcp 65001
from inside PowerShell is not effective.You mention FreePascal, whose Unicode support is described here.
However, your screen shot implies that your FreePascal program's output is not UTF-8, because the rounded-corner characters were transcoded to?
characters (which suggests a lossy transcoding to the system's OEM code page, where these characters aren't present).Therefore, to solve your problem you must ensure that your FreePascal program either unconditionally outputs UTF-8 or honors the active code page (as reported by
chcp
), assuming you've first set it to65001
(the UTF-8 code page; see below).
Choose a font that can render the rounded-corner Unicode characters (such as
╭
(, as shown in your question), butConsolas
, for instance (which PowerShell [Core] 6+ uses by default), can.
Using UTF-8 encoding with external programs consistently:
Note:
The command below is neither necessary for nor does it have any effect on PowerShell commands such as the
Get-Content
cmdlet.Some legacy console applications - notably
more.com
(which Windows PowerShell wraps in amore
function) - fundamentally do not support Unicode, only the legacy OEM code pages.[*]
According to every answer I can find online, CHCP 65001 switches the code page in PowerShell to UTF-8
chcp 65001
does not work if run from within PowerShell, because .NET caches the[Console]::OutputEncoding
value at PowerShell session startup, with the code page that was in effect at that time.Instead, you can use the following to fully make a console window UTF-8 aware (which implicitly also makes
chcp
report65001
afterwards):$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
This makes PowerShell interpret an external program's output as UTF-8, and also encodes data it sends to external program as UTF-8 (thanks to preference variable
$OutputEncoding
).See this answer for more information.
[*] With the UTF-8 code page
65001
in effect,more
quietly skips lines that contain at least one Unicode character that cannot be mapped onto the system's OEM code page (any character not present in the system's single-byte OEM code page, which can only represent 256 characters), which in this case applies to the lines that contain the rounded-corner characters such as╭
(BOX DRAWINGS LIGHT ARC DOWN AND RIGHT,U+256D
).这篇关于Linux和Windows之间的文本编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-