argv 的编码是什么? [英] What is the encoding of argv?

查看:17
本文介绍了argv 的编码是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不清楚在 C 的 argv 中使用了哪些编码.特别是,我对以下场景感兴趣:

It's not clear to me what encodings are used where in C's argv. In particular, I'm interested in the following scenario:

  • 用户使用语言环境 L1 创建一个文件,其名称 N 包含非 ASCII 字符
  • 稍后,用户使用语言环境 L2 在命令行上用制表符完成该文件的名称,该名称作为命令行参数输入程序 P
  • A user uses locale L1 to create a file whose name, N, contains non-ASCII characters
  • Later on, a user uses locale L2 to tab-complete the name of that file on the command line, which is fed into a program P as a command line argument

P 在命令行上看到的字节序列是什么?

What sequence of bytes does P see on the command line?

我观察到在 Linux 上,在 UTF-8 语言环境中创建一个文件名,然后在(例如)zw_TW.big5 语言环境中用制表符完成它似乎会导致我的程序 P 被输入UTF-8 而不是 Big5.但是,在 OS X 上,相同的一系列操作会导致我的程序 P 获得一个 Big5 编码的文件名.

I have observed that on Linux, creating a filename in the UTF-8 locale and then tab-completing it in (e.g.) the zw_TW.big5 locale seems to cause my program P to be fed UTF-8 rather than Big5. However, on OS X the same series of actions results in my program P getting a Big5 encoded filename.

以下是我目前的想法(很长,我可能错了,需要更正):

Here is what I think is going on so far (long, and I'm probably wrong and need to be corrected):

文件名以某种 Unicode 格式存储在磁盘上.因此,Windows 取名称 N,将 L1(当前代码页)转换为 N 的 Unicode 版本,我们称之为 N1,并存储N1 在磁盘上.

File names are stored on disk in some Unicode format. So Windows takes the name N, converts from L1 (the current code page) to a Unicode version of N we will call N1, and stores N1 on disk.

然后我假设发生的是,在稍后完成选项卡时,名称 N1 被转换为语言环境 L2(新的当前代码页)以进行显示.幸运的是,这将产生原始名称 N -- 但如果 N 包含在 L2 中无法表示的字符,则情况不会如此.我们称新名称为N2.

What I then assume happens is that when tab-completing later on, the name N1 is converted to locale L2 (the new current code page) for display. With luck, this will yield the original name N -- but this won't be true if N contained characters unrepresentable in L2. We call the new name N2.

当用户实际按下回车键运行带有该参数的 P 时,名称 N2 被转换回 Unicode,再次产生 N1.此 N1 现在可通过 GetCommandLineW/wmain/tmain 以 UCS2 格式提供给程序,但是 N1 的用户code>GetCommandLine/main 将在当前语言环境(代码页)中看到名称 N2.

When the user actually presses enter to run P with that argument, the name N2 is converted back into Unicode, yielding N1 again. This N1 is now available to the program in UCS2 format via GetCommandLineW/wmain/tmain, but users of GetCommandLine/main will see the name N2 in the current locale (code page).

据我所知,磁盘存储的故事是一样的.OS X 将文件名存储为 Unicode.

The disk-storage story is the same, as far as I know. OS X stores file names as Unicode.

对于 Unicode 终端,我认为是终端在 Unicode 缓冲区中构建命令行.因此,当您完成 Tab 键时,它会将文件名作为 Unicode 文件名复制到该缓冲区.

With a Unicode terminal, I think what happens is that the terminal builds the command line in a Unicode buffer. So when you tab complete, it copies the file name as a Unicode file name to that buffer.

当您运行该命令时,该 Unicode 缓冲区将转换为当前语言环境 L2,并通过 argv 提供给程序,并且程序可以使用当前语言环境将 argv 解码为 Unicode 以进行显示.

When you run the command, that Unicode buffer is converted to the current locale, L2, and fed to the program via argv, and the program can decode argv with the current locale into Unicode for display.

在 Linux 上,一切都不同,我对正在发生的事情感到非常困惑.Linux 将文件名存储为字节字符串,而不是 Unicode.因此,如果您在语言环境 L1 中创建一个名为 N 的文件,N 作为字节字符串存储在磁盘上.

On Linux, everything is different and I'm extra-confused about what is going on. Linux stores file names as byte strings, not in Unicode. So if you create a file with name N in locale L1 that N as a byte string is what is stored on disk.

当我稍后运行终端并尝试使用 Tab 完成名称时,我不确定会发生什么.在我看来,命令行被构造为字节缓冲区,并且文件的名称作为字节字符串 只是连接到该缓冲区上.我假设当您键入一个标准字符时,它会被动态编码为附加到该缓冲区的字节.

When I later run the terminal and try and tab-complete the name, I'm not sure what happens. It looks to me like the command line is constructed as a byte buffer, and the name of the file as a byte string is just concatenated onto that buffer. I assume that when you type a standard character it is encoded on the fly to bytes that are appended to that buffer.

当你运行一个程序时,我认为缓冲区被直接发送到argv.现在,argv 有什么编码?看起来您在 L2 语言环境中在命令行中键入的任何字符都将采用 L2 编码,但文件名将采用 L1 编码.所以 argv 包含两种编码的混合!

When you run a program, I think that buffer is sent directly to argv. Now, what encoding does argv have? It looks like any characters you typed in the command line while in locale L2 will be in the L2 encoding, but the file name will be in the L1 encoding. So argv contains a mixture of two encodings!

如果有人能让我知道这里发生了什么,我真的很高兴.我目前所拥有的只是半猜测和猜测,它们并不能真正融合在一起.我真正想要的是 argv 在当前代码页 (Windows) 或当前语言环境 (Linux/OS X) 中编码,但情况似乎并非如此...

I'd really like it if someone could let me know what is going on here. All I have at the moment is half-guesses and speculation, and it doesn't really fit together. What I'd really like to be true is for argv to be encoded in the current code page (Windows) or the current locale (Linux / OS X) but that doesn't seem to be the case...

这是一个简单的候选程序 P,可以让您自己观察编码:

Here is a simple candidate program P that lets you observe encodings for yourself:

#include <stdio.h>

int main(int argc, char **argv)
{
    if (argc < 2) {
        printf("Not enough arguments
");
        return 1;
    }
    
    int len = 0;
    for (char *c = argv[1]; *c; c++, len++) {
        printf("%d ", (int)(*c));
    }
    
    printf("
Length: %d
", len);
    
    return 0;
}

您可以使用 locale -a 查看可用的语言环境,并使用 export LC_ALL=my_encoding 更改您的语言环境.

You can use locale -a to see available locales, and use export LC_ALL=my_encoding to change your locale.

推荐答案

感谢大家的回复.我已经了解了很多关于这个问题的知识,并发现了以下解决了我的问题的事情:

Thanks everyone for your responses. I have learnt quite a lot about this issue and have discovered the following things that has resolved my question:

  1. 如前所述,在 Windows 上,argv 使用当前代码页进行编码.但是,您可以使用 GetCommandLineW 以 UTF-16 格式检索命令行.不建议将 argv 用于支持 unicode 的现代 Windows 应用程序,因为代码页已被弃用.

  1. As discussed, on Windows the argv is encoded using the current code page. However, you can retrieve the command line as UTF-16 using GetCommandLineW. Use of argv is not recommended for modern Windows apps with unicode support because code pages are deprecated.

在 Unix 上,argv 没有固定的编码:

On Unixes, the argv has no fixed encoding:

a) 由 tab-completion/globbing 插入的文件名将在 argv verbatim 中出现,就像它们在磁盘上命名的字节序列一样.即使这些字节序列在当前语言环境中没有意义,也是如此.

a) File names inserted by tab-completion/globbing will occur in argv verbatim as exactly the byte sequences by which they are named on disk. This is true even if those byte sequences make no sense in the current locale.

b) 用户使用他们的 IME 直接输入的输入将出现在区域设置编码的 argv 中.(Ubuntu 似乎使用 LOCALE 来决定如何编码 IME 输入,而 OS X 使用 Terminal.app 编码首选项.)

b) Input entered directly by the user using their IME will occur in argv in the locale encoding. (Ubuntu seems to use LOCALE to decide how to encode IME input, whereas OS X uses the Terminal.app encoding Preference.)

这对于 Python、Haskell 或 Java,希望将命令行参数视为字符串.他们需要决定如何将 argv 解码为 String 内部使用的任何编码(这些语言是 UTF-16).但是,如果他们只是使用语言环境编码来进行解码,那么输入中的有效文件名可能无法解码,从而导致异常.

This is annoying for languages such as Python, Haskell or Java, which want to treat command line arguments as strings. They need to decide how to decode argv into whatever encoding is used internally for a String (which is UTF-16 for those languages). However, if they just use the locale encoding to do this decoding, then valid filenames in the input may fail to decode, causing an exception.

Python 3 对这个问题的解决方案是代理字节编码方案(http://www.python.org/dev/peps/pep-0383/) 将 argv 中任何不可解码的字节表示为特殊的 Unicode 代码点.当该代码点被解码回字节流时,它再次成为原始字节.这允许将 argv 中在当前编码中无效的数据(即以当前语言环境以外的其他名称命名的文件名)通过本机 Python 字符串类型往返传输到字节,而不会丢失信息.

The solution to this problem adopted by Python 3 is a surrogate-byte encoding scheme (http://www.python.org/dev/peps/pep-0383/) which represents any undecodable byte in argv as special Unicode code points. When that code point is decoded back to a byte stream, it just becomes the original byte again. This allows for roundtripping data from argv that is not valid in the current encoding (i.e. a filename named in something other than the current locale) through the native Python string type and back to bytes with no loss of information.

如您所见,情况非常混乱:-)

As you can see, the situation is pretty messy :-)

这篇关于argv 的编码是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆