什么是argv的编码? [英] What is the encoding of argv?

查看:121
本文介绍了什么是argv的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不清楚C的 argv 中使用了哪些编码。特别是,我对以下情况感兴趣:




  • 用户使用区域设置L1创建名称为 N ,包含非ASCII字符

  • 稍后,用户使用区域设置L2在命令行上标记该文件的名称,被输入程序P作为命令行参数



P在命令行上看到什么字节序列?



我已经观察到,在Linux上,在UTF-8语言环境中创建一个文件名,然后在(例如) zw_TW.big5 locale似乎使我的程序P被送入UTF-8而不是 Big5 。然而,在OS X上,同一系列的操作导致我的程序P得到一个 Big5 编码的文件名。



这里是我认为目前为止(很久以来,我可能是错误的,需要更正):



Windows



文件名以某种Unicode格式存储在磁盘上。所以Windows将名称 N ,从L1(当前代码页)转换为Unicode版本的 N 我们将调用 N1 ,并在磁盘上存储 N1



假设发生的是当稍后的标签完成时,将名称 N1 转换为区域设置L2(新的当前代码页),以便显示。幸运的是,这将产生原始名称 N - 但如果 N 包含字符不可代表,则不会如此在L2。我们称之为新名称 N2



当用户实际按Enter键以使用该参数运行P时,名称 N2 被转换回Unicode,再次产生 N1 。该 N1 现在可通过 GetCommandLineW / wmain / tmain ,但 GetCommandLine / main 将在当前语言环境(代码页)中看到名称 N2



OS X



据我所知,磁盘存储的故事是一样的。 OS X将文件名存储为Unicode。



使用Unicode终端,我认为发生了什么事情是终端在Unicode中构建命令行缓冲。因此,当您完成标签时,它将文件名称作为Unicode文件名复制到该缓冲区。



当您运行该命令时,该Unicode缓冲区将转换为当前的区域设置,L2,并通过 argv 馈送到程序,程序可以将当前语言环境的argv解码为Unicode进行显示。



Linux



在Linux上,一切都是不同的,我对于发生了什么是非常困惑的。 Linux将文件名存储为字节字符串,而不是Unicode。因此,如果您在区域设置L1中创建一个名称为 N 的文件,那么 N 作为字节字符串是存储在磁盘上的



当我稍后运行终端并尝试并选中完成名称时,我不知道会发生什么。它看起来像命令行被构造为字节缓冲区,并且文件作为字节串的名称刚刚连接到该缓冲区上。我假设当你键入一个标准的字符时,它会被随机编码到该缓冲区附加的字节上。



当你运行一个程序时,我认为缓冲区是直接发送到 argv 。现在, argv 有什么编码?它看起来像您在命令行中键入的任何字符,而在区域设置中,L2将处于L2编码中,但文件名将位于L1编码中。所以 argv 包含两个编码的混合!



问题



我真的很喜欢,如果有人可以让我知道这里发生了什么。我现在所有的一切都是猜测和猜测,并不是真的适合。我真正想要的是将 argv 编码在当前的代码页(Windows)或当前的区域设置(Linux / OS X)中,似乎是这样...



其他



这是一个简单的候选程序P,让你观察自己的编码:

  #include< stdio.h> 

int main(int argc,char ** argv)
{
if(argc< 2){
printf(not enough arguments\\\
);
return 1;
}

int len = 0; (char * c = argv [1]; * c; c ++,len ++){
printf(%d,(int)(* c));

}

printf(\\\
Length:%d\\\
,len);

return 0;
}

您可以使用 locale -a 以查看可用的区域设置,并使用 export LC_ALL = my_encoding 来更改您的区域设置。

解决方案感谢大家的回应。我已经学到了很多关于这个问题,并发现了以下解决我的问题的事情:


  1. Windows使用当前代码页对argv进行编码。但是,您可以使用GetCommandLineW将命令行检索为UTF-16。对于具有unicode支持的现代Windows应用程序,不建议使用argv。因为代码页已被弃用。


  2. 在Unix上,argv没有固定的编码:



    a)通过tab-completion / globbing插入的文件名将在argv 逐字中出现,恰好是它们在磁盘上命名的字节序列。这是真的,即使这些字节序列在当前语言环境中没有任何意义。



    b)用户使用其IME直接输入的输入将发生在语言环境编码中的argv中。 (Ubuntu似乎使用LOCALE来决定如何编码IME输入,而OS X使用Terminal.app编码偏好。)


这对Python,Haskell或 Java ,它希望将命令行参数视为字符串。他们需要决定如何将 argv 解码为内部为 String (其中UTF-16为那些语言)。但是,如果他们只使用区域编码来进行此解码,则输入中的有效文件名可能无法解码,导致异常。



Python 3是一种代理字节编码方案( http://www.python。 org / dev / peps / pep-0383 / ),它将argv中的任何未解码的字节表示为特殊的Unicode代码点。当该代码点被解码回到一个字节流时,它再次成为原始字节。这允许通过本机Python字符串类型在当前编码(即,以当前语言环境以外的其他名称命名的文件名)中无效的argv中的数据往返并返回到不丢失信息的字节。



正如你所看到的,情况是相当凌乱的: - )


It's not clear to me what encodings are used where in C's argv. In particular, I'm interested in the following scenario:

  • A user uses locale L1 to create a file whose name, N, contains non-ASCII characters
  • Later on, a user uses locale L2 to tab-complete the name of that file on the command line, which is fed into a program P as a command line argument

What sequence of bytes does P see on the command line?

I have observed that on Linux, creating a filename in the UTF-8 locale and then tab-completing it in (e.g.) the zw_TW.big5 locale seems to cause my program P to be fed UTF-8 rather than Big5. However, on OS X the same series of actions results in my program P getting a Big5 encoded filename.

Here is what I think is going on so far (long, and I'm probably wrong and need to be corrected):

Windows

File names are stored on disk in some Unicode format. So Windows takes the name N, converts from L1 (the current code page) to a Unicode version of N we will call N1, and stores N1 on disk.

What I then assume happens is that when tab-completing later on, the name N1 is converted to locale L2 (the new current code page) for display. With luck, this will yield the original name N -- but this won't be true if N contained characters unrepresentable in L2. We call the new name N2.

When the user actually presses enter to run P with that argument, the name N2 is converted back into Unicode, yielding N1 again. This N1 is now available to the program in UCS2 format via GetCommandLineW/wmain/tmain, but users of GetCommandLine/main will see the name N2 in the current locale (code page).

OS X

The disk-storage story is the same, as far as I know. OS X stores file names as Unicode.

With a Unicode terminal, I think what happens is that the terminal builds the command line in a Unicode buffer. So when you tab complete, it copies the file name as a Unicode file name to that buffer.

When you run the command, that Unicode buffer is converted to the current locale, L2, and fed to the program via argv, and the program can decode argv with the current locale into Unicode for display.

Linux

On Linux, everything is different and I'm extra-confused about what is going on. Linux stores file names as byte strings, not in Unicode. So if you create a file with name N in locale L1 that N as a byte string is what is stored on disk.

When I later run the terminal and try and tab-complete the name, I'm not sure what happens. It looks to me like the command line is constructed as a byte buffer, and the name of the file as a byte string is just concatenated onto that buffer. I assume that when you type a standard character it is encoded on the fly to bytes that are appended to that buffer.

When you run a program, I think that buffer is sent directly to argv. Now, what encoding does argv have? It looks like any characters you typed in the command line while in locale L2 will be in the L2 encoding, but the file name will be in the L1 encoding. So argv contains a mixture of two encodings!

Question

I'd really like it if someone could let me know what is going on here. All I have at the moment is half-guesses and speculation, and it doesn't really fit together. What I'd really like to be true is for argv to be encoded in the current code page (Windows) or the current locale (Linux / OS X) but that doesn't seem to be the case...

Extras

Here is a simple candidate program P that lets you observe encodings for yourself:

#include <stdio.h>

int main(int argc, char **argv)
{
    if (argc < 2) {
        printf("Not enough arguments\n");
        return 1;
    }

    int len = 0;
    for (char *c = argv[1]; *c; c++, len++) {
        printf("%d ", (int)(*c));
    }

    printf("\nLength: %d\n", len);

    return 0;
}

You can use locale -a to see available locales, and use export LC_ALL=my_encoding to change your locale.

解决方案

Thanks everyone for your responses. I have learnt quite a lot about this issue and have discovered the following things that has resolved my question:

  1. As discussed, on Windows the argv is encoded using the current code page. However, you can retrieve the command line as UTF-16 using GetCommandLineW. Use of argv is not recommended for modern Windows apps with unicode support because code pages are deprecated.

  2. On Unixes, the argv has no fixed encoding:

    a) File names inserted by tab-completion/globbing will occur in argv verbatim as exactly the byte sequences by which they are named on disk. This is true even if those byte sequences make no sense in the current locale.

    b) Input entered directly by the user using their IME will occur in argv in the locale encoding. (Ubuntu seems to use LOCALE to decide how to encode IME input, whereas OS X uses the Terminal.app encoding Preference.)

This is annoying for languages such as Python, Haskell or Java, which want to treat command line arguments as strings. They need to decide how to decode argv into whatever encoding is used internally for a String (which is UTF-16 for those languages). However, if they just use the locale encoding to do this decoding, then valid filenames in the input may fail to decode, causing an exception.

The solution to this problem adopted by Python 3 is a surrogate-byte encoding scheme (http://www.python.org/dev/peps/pep-0383/) which represents any undecodable byte in argv as special Unicode code points. When that code point is decoded back to a byte stream, it just becomes the original byte again. This allows for roundtripping data from argv that is not valid in the current encoding (i.e. a filename named in something other than the current locale) through the native Python string type and back to bytes with no loss of information.

As you can see, the situation is pretty messy :-)

这篇关于什么是argv的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆