如何检测的MinGW的命令行参数的字符编码 [英] How to detect the character encoding of command line arguments in mingw

查看:656
本文介绍了如何检测的MinGW的命令行参数的字符编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

它是安全的假设他们都通过了ISO-8859-15(窗口 - 1252?),或者是有一些功能,我可以打电话查询呢?最终的目标是要转换为UTF-8。


背景:

这个问题的产生是因为XMLStarlet假定它的命令行参数是UTF-8所描述的问题。在Windows下,似乎他们实际上是ISO-8859-15,或至少加入以下使事情工作的开始(窗口 - 1252):

 的char ** utf8argv =的malloc(sizeof的(字符*)*(ARGC + 1));
utf8argv [ARGC] = NULL;{
    iconv_t windows2utf8 = iconv_open子(UTF-8,ISO-8859-15);
    INT I;
    对于(i = 0; I< ARGC,我++){
        为const char * ARG =的argv [I]
        size_t型LEN = strlen的(ARG);
        为size_t outlen = LEN * 2 + 1;
        字符* utfarg =的malloc(outlen);        字符* OUT = utfarg;
        为size_t RET =的iconv(windows2utf8,
            &安培;阿根廷,和放大器; LEN,
            &安培;出来,和放大器; outlen);        如果(保留℃,){
            PERROR(的iconv);
            utf8argv [I] = NULL;
            继续;
        }        出来[0] ='\\ 0';
        utf8argv [I] = utfarg;
    }    ARGV = utf8argv;
}


测试编码

下面的程序打印出小数的第一个参数的字节数:

 的#include< strings.h>
#包括LT&;&stdio.h中GT;INT主(INT ARGC,CHAR *的argv [])
{
    的for(int i = 0; I<的strlen(的argv [1]);我++){
        的printf(%D(无符号字符)的argv [1] [I]);
    }
    的printf(\\ n);
    返回0;
}

CHCP 报告code页面的 850 ,这样的人物æ和AE应分别为145和146。

  C:\\用户\\ npostavs \\ TMP> CHCP
主动code页:850

但我们看到230和198报相匹配 1252

  C:\\用户\\ npostavs \\ TMP> CMD-字符æÆ
230 198

codePAGE之外的字符传递有损引起改造

制作一个快捷方式到 CMD-chars.exe 带参数αβγ(这些都不是present codePAGE 1252)给出了

  C:\\用户\\ npostavs \\ TMP>快捷-CMD-chars.lnk
97 223 63

这是的屁股?


解决方案

您可以调用的 CommandLineToArgvW 一起的 GetCommandLineW 作为第一个参数,以获得命令行参数在一个宽的argv 式的阵列字符串。这是唯一的便携式Windows操作方式,尤其是与code页面混乱;日文字符可以通过Windows快捷方式,例如通过。在此之后,您可以使用调用WideCharToMultiByte CP_UTF8 到每个宽字符转换的argv 元素UTF-一个code页面参数8。

请注意,0调用调用WideCharToMultiByte 与输出缓冲区的大小(字节数)将允许您确定为指定的字符数所需的UTF-8字节数(或整个宽字符串包括空终止,如果你想传递-1作为宽字符数,简化您的code)。然后你就可以使用分配的malloc 等字节的所需数量。和呼叫调用WideCharToMultiByte 再次使用正确的字节数,而不是0。如果这是性能关键,不同的解决方案可能是最好的,但由于这是一个一次性函数来获取命令行参数,我会说任何性能下降可以忽略不计。

当然,不要忘了释放所有你的记忆中,包括调用 LocalFree CommandLineToArgvW 作为参数。<​​/ p>

有关功能的详细信息以及如何使用它们,点击链接查看MSDN文档。

Is it safe to assume they are ISO-8859-15 (Window-1252?), or is there some function I can call to query this? The end goal is to conversion to UTF-8.


Background:

The problem described by this question arises because XMLStarlet assumes its command line arguments are UTF-8. Under Windows it seems they are actually ISO-8859-15 (Window-1252?), or at least adding the following to the beginning of main makes things work:

char **utf8argv = malloc(sizeof(char*) * (argc+1));
utf8argv[argc] = NULL;

{
    iconv_t windows2utf8 = iconv_open("UTF-8", "ISO-8859-15");
    int i;
    for (i = 0; i < argc; i++) {
        const char *arg = argv[i];
        size_t len = strlen(arg);
        size_t outlen = len*2 + 1;
        char *utfarg = malloc(outlen);

        char *out = utfarg;
        size_t ret = iconv(windows2utf8,
            &arg, &len,
            &out, &outlen);

        if (ret < 0) {
            perror("iconv");
            utf8argv[i] = NULL;
            continue;
        }

        out[0] = '\0';
        utf8argv[i] = utfarg;
    }

    argv = utf8argv;
}


Testing Encoding

The following program prints out the bytes of its first argument in decimal:

#include <strings.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    for (int i = 0; i < strlen(argv[1]); i++) {
        printf("%d ", (unsigned char) argv[1][i]);
    }
    printf("\n");
    return 0;
}

chcp reports code page 850, so the characters æ and Æ should be 145 and 146, respectively.

C:\Users\npostavs\tmp>chcp
Active code page: 850

But we see 230 and 198 reported which matches 1252:

C:\Users\npostavs\tmp>cmd-chars æÆ
230 198

Passing characters outside of codepage causes lossy transformation

Making a shortcut to cmd-chars.exe with arguments αβγ (these are not present in codepage 1252) gives

C:\Users\npostavs\tmp>shortcut-cmd-chars.lnk
97 223 63

Which is aß?.

解决方案

You can call CommandLineToArgvW with a call to GetCommandLineW as the first argument to get the command-line arguments in an argv-style array of wide strings. This is the only portable Windows way, especially with the code page mess; Japanese characters can be passed via a Windows shortcut for example. After that, you can use WideCharToMultiByte with a code page argument of CP_UTF8 to convert each wide-character argv element to UTF-8.

Note that calling WideCharToMultiByte with an output buffer size (byte count) of 0 will allow you to determine the number of UTF-8 bytes required for the number of characters specified (or the entire wide string including the null terminator if you wish to pass -1 as the number of wide characters to simplify your code). Then you can allocate the required number of bytes using malloc et al. and call WideCharToMultiByte again with the correct number of bytes instead of 0. If this was performance-critical, a different solution would probably be best, but since this is a one-time function to get command-line arguments, I'd say any decrease in performance would be negligible.

Of course, don't forget to free all of your memory, including calling LocalFree with the pointer returned by CommandLineToArgvW as the argument.

For more info on the functions and how you can use them, click the links to see the MSDN documentation.

这篇关于如何检测的MinGW的命令行参数的字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆