Linux和Windows之间的文本编码 [英] Text Encoding between Linux and Windows

查看:62
本文介绍了Linux和Windows之间的文本编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的主要问题是如何获取Linux中的文本文件以在PowerShell中正确显示.

在Linux中,我的文本文件带有一些特殊字符,实际上,记事本显示的文本文件与Linux中显示的完全相同:

不幸的是,我的程序会打印到Linux终端,因此我的Windows终端需要相同的输出.通过其他答案,我已经看到

  1. 我需要使用TrueType字体,所以我正在使用Lucidia Console
  2. 在我的Linux设备上,编码为UTF-8.根据我在网上可以找到的每个答案,CHCP 65001将PowerShell中的代码页切换为UTF-8
  3. Windows Powershell可以更好地显示内容,因此尽管我尝试使用命令提示符,但现在可以在PowerShell中工作.

使用CHCP 65001,然后键入

 更多my_file.txt 

显示以下内容:

使用

 获取内容-编码UTF8 my_file.txt 

输出:

这两个结果都不够好,但是我实际上担心Get-Content在这里所做的完全不同.我要传输到Windows的代码是用Free Pascal编写的,在Free Pascal中,我可以提供UTF-8代码页,仅此而已.因此,虽然Get-Content对我来说是检查PowerShell是否能够产生所需输出的好命令,但对我来说使用它并不实际.在Pascal中,输出(写入到PowerShell显示中)显示为:

这也是不好的,这些行应该连接,因为它们在Linux中是行的(显然某些字符被解释为?).但是,这可能是在Pascal中选择的代码页的问题,这将是下一步.

我现在的问题是,如何使Windows Powershell在默认情况下显示与记事本版本中显示的文本文件相同的文件.对我来说,在任何地方在代码中运行Get-Content都是不切实际的,因此,尽管该结果看起来更有希望,但我无法遵循.

作为后续问题,因为我无法在线上找到它,所以在显示内容时这里的主要参与者是什么,因为它显然比编码更重要.为什么更多"和获取内容"命令显示不同的输出?为什么获取内容"无法读取所有内容?我以为UTF-8是一个通用标准,并且能够读取UTF-8的程序至少可以实际读取所有字符,但是它们的读取方式有所不同.

以文本形式输入的是:

 ╭──────╮││╭─│───╮│││││││╭─│────╮│││││││╭─│───││╯│││││││││╰─╯│││││││╰────────│─╯││╰─────────╯ 

针对下面发布的答案,我可以看到

 更多my_file.txt 

产生

的更多命令输出

使用

  $ OutputEncoding = [控制台] :: InputEncoding = [控制台] :: OutputEncoding =新对象System.Text.UTF8Encoding 

解决方案

  • 请确保您的UTF-8编码文本文件具有BOM -否则, Windows PowerShell将误解您的文件是根据系统活动的ANSI代码页进行编码的(而遗憾的是,PowerShell [Core] 6+现在始终在没有BOM的情况下始终默认为UTF-8).

    • 或者,使用 Get-Content -Encoding Utf8 my_file.txt 显式指定文件的编码.

    • 有关Windows PowerShell与PowerShell [Core]中字符编码的全面讨论,请参见,如您的问题所示),但是例如 Consolas (该PowerShell[Core]默认使用6 +),可以.


    始终与外部程序一起使用UTF-8编码 :

    注意:

    • 下面的命令既不是必需的,也不会对 PowerShell 命令(如 Get-Content cmdlet)产生任何影响.

    • 某些旧版控制台应用程序-特别是 more.com (Windows PowerShell将其包装为 more 功能)-根本不支持Unicode ,仅支持旧版OEM代码页. [*]

    根据我在网上可以找到的每个答案,CHCP 65001将PowerShell中的代码页切换为UTF-8

    如果

    chcp 65001 如果在PowerShell中用 运行,则不起作用,因为.NET 会缓存 PowerShell会话启动时的 [Console] :: OutputEncoding 值,以及当时有效的代码页.

    相反,您可以使用以下命令使控制台窗口完全支持UTF-8(这也将使 chcp 随后报告 65001 ):

      $ OutputEncoding = [控制台] :: InputEncoding = [控制台] :: OutputEncoding =新对象System.Text.UTF8Encoding 

    这使得PowerShell将外部程序的输出解释为UTF-8,并且还将发送给 的数据编码为UTF-8(这要归功于首选项变量 $ OutputEncoding ).

    有关更多信息,请参见此答案.


    [*]启用UTF-8代码页 65001 后, more 安静地跳过包含至少一个字符的行无法映射到系统OEM代码页上的Unicode字符(系统单字节OEM代码页上不存在的任何字符,只能表示256个字符),在这种情况下,该字符适用于包含圆角字符的行例如(框图,右下角的框, U + 256D ).

    The main question I have is how can I get a textfile that I have in Linux to display properly in PowerShell.

    In Linux, I have text files with some special characters, and in fact Notepad displays the text file exactly as it is displayed in Linux:

    Unfortunately, my program prints to my Linux Terminal, and thus I need the same output in my Windows terminal. I have seen through other answers that

    1. I need to use a TrueType font, so I am using Lucidia Console
    2. on my Linux device, the encoding is UTF-8. According to every answer I can find online, CHCP 65001 switches the code page in PowerShell to UTF-8
    3. Windows Powershell is better equipped to display content, so while I have tried using the command prompt, I am now working in PowerShell.

    Using CHCP 65001 and then typing

    more my_file.txt
    

    displays this:

    while using

    Get-Content -Encoding UTF8 my_file.txt
    

    outputs:

    Neither of these results is good enough, but I am actually concerned that Get-Content does something different at all here. The code that I am transferring to windows is written in Free Pascal, and in Free Pascal, I can provide a UTF-8 codepage, but that's it. So while Get-Content is a good command for me to check if PowerShell is capable of producing the desired output, it is not practical for me to use it. In Pascal, the output (which is written to the PowerShell display) appears as:

    Which is bad as well, those lines should connect because they do in Linux (and obviously some characters are interpreted just as ?). However, this might be a problem with the codepage picked in Pascal, which would be a next step.

    My question right now is, how can I get the Windows Powershell to, by default, display a text file as it is shown in the notepad version. It is not practical for me to run Get-Content in my code everywhere, so although that result appears more promising, I cannot follow that.

    As a follow up question, because I could not find it anywhere online, what are the main players here when it comes to displaying content, because it is clearly a bigger story than just the encoding. Why are the 'more' and the 'Get-Content' commands displaying different outputs? And why can 'Get-Content' not read all of the content? I had assumed UTF-8 was a universal standard, and programs who can read UTF-8 could at least actually read all of the characters, but they're all reading it differently.

    The input, as text, is:

        ╭─────╮
        │     │
      ╭─│───╮ │
      │ │   │ │
      │ │ ╭─│───╮
      │ │ │ │ │ │
    ╭─│───│─╯ │ │
    │ │ │ │   │ │
    │ │ ╰─╯   │ │
    │ │       │ │
    │ ╰───────│─╯
    │         │
    ╰─────────╯
    
    

    In response to an answer posted below, I can see that

    more my_file.txt
    

    produces

    when using

    $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = 
      New-Object System.Text.UTF8Encoding 
    

    解决方案

    • Make sure that your UTF-8-encoded text file has a BOM - otherwise, your file will be misinterpreted by Windows PowerShell as being encoded based on the system's active ANSI code page (whereas PowerShell [Core] 6+ now thankfully consistently defaults to UTF-8 in the absence of a BOM).

      • Alternatively, use Get-Content -Encoding Utf8 my_file.txt to explicitly specify the file's encoding.

      • For a comprehensive discussion of character encoding in Windows PowerShell vs. PowerShell [Core], see this answer.

    • For output from external programs to be correctly captured in a variable or correctly redirect to a file, you need to set [Console]::OutputEncoding to the character encoding that the given program uses on output (for mere printing to the display this may not be necessary, however):

      • If code page 65001 (UTF-8) is in effect and your program honors that, you'll need to set [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding; see below for how to ensure that 65001 is truly in effect, given that running chcp 65001 from inside PowerShell is not effective.

      • You mention FreePascal, whose Unicode support is described here.
        However, your screen shot implies that your FreePascal program's output is not UTF-8, because the rounded-corner characters were transcoded to ? characters (which suggests a lossy transcoding to the system's OEM code page, where these characters aren't present).

      • Therefore, to solve your problem you must ensure that your FreePascal program either unconditionally outputs UTF-8 or honors the active code page (as reported by chcp), assuming you've first set it to 65001 (the UTF-8 code page; see below).

    • Choose a font that can render the rounded-corner Unicode characters (such as (, as shown in your question), but Consolas, for instance (which PowerShell [Core] 6+ uses by default), can.


    Using UTF-8 encoding with external programs consistently:

    Note:

    • The command below is neither necessary for nor does it have any effect on PowerShell commands such as the Get-Content cmdlet.

    • Some legacy console applications - notably more.com (which Windows PowerShell wraps in a more function) - fundamentally do not support Unicode, only the legacy OEM code pages.[*]

    According to every answer I can find online, CHCP 65001 switches the code page in PowerShell to UTF-8

    chcp 65001 does not work if run from within PowerShell, because .NET caches the [Console]::OutputEncoding value at PowerShell session startup, with the code page that was in effect at that time.

    Instead, you can use the following to fully make a console window UTF-8 aware (which implicitly also makes chcp report 65001 afterwards):

    $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding =
                        New-Object System.Text.UTF8Encoding
    

    This makes PowerShell interpret an external program's output as UTF-8, and also encodes data it sends to external program as UTF-8 (thanks to preference variable $OutputEncoding).

    See this answer for more information.


    [*] With the UTF-8 code page 65001 in effect, more quietly skips lines that contain at least one Unicode character that cannot be mapped onto the system's OEM code page (any character not present in the system's single-byte OEM code page, which can only represent 256 characters), which in this case applies to the lines that contain the rounded-corner characters such as (BOX DRAWINGS LIGHT ARC DOWN AND RIGHT, U+256D).

    这篇关于Linux和Windows之间的文本编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆