在Windows上使用fgets()从stdin读取UTF-8 [英] Reading UTF-8 from stdin using fgets() on Windows

查看:337
本文介绍了在Windows上使用fgets()从stdin读取UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用fgets()stdin读取UTF-8字符串.控制台输入模式之前已设置为CP_UTF8.我还在PowerShell中将控制台字体设置为Lucida Console.最后,我通过使用printf()在控制台上打印德语Ä(在UTF-8中为0xC3,0x84)来验证UTF-8输出是否正常工作.这可以正常工作,但是fgets()似乎无法从控制台读取UTF-8.这是一个小测试程序:

I'm trying to read a UTF-8 string from stdin using fgets(). The console input mode has been set to CP_UTF8 before. I've also set the console font to Lucida Console in PowerShell. Finally, I've verified that UTF-8 output is working by printing a German Ä (in UTF-8: 0xC3,0x84) to the console using printf(). This is working correctly but fgets() doesn't seem to be able to read UTF-8 from the console. Here is a small test program:

#include <stdio.h>  
#include <windows.h>

int main(int argc, char *argv[])
{
    unsigned char s[64];

    memset(s, 0, 64);

    SetConsoleOutputCP(CP_UTF8);    
    SetConsoleCP(CP_UTF8);

    printf("UTF-8 Test: %c%c\n", 0xc3, 0x84);  // print Ä

    fgets(s, 64, stdin);

    printf("Result: %d %d\n", s[0], s[1]);

    return 0;
}

运行该程序并输入Ä"然后按Enter,它只会打印以下内容:

When running this program and entering "Ä" and then hitting ENTER, it just prints the following:

Result: 0 0

即没有任何内容写入s.但是,当键入"A"时,我得到以下正确结果:

i.e. nothing has been written to s. When typing "A", however, I get the following correct result:

Result: 65 10

那么,如何在Windows上使fgets()与UTF-8字符一起使用?

So how can I make fgets() work with UTF-8 characters on Windows please?

EDIT

EDIT

基于Barmak的解释,我现在更新了代码以使用wchar_t函数而不是ANSI函数.但是,它仍然不起作用.这是我的代码:

Based on Barmak's explanations, I've now updated my code to use wchar_t functions instead of the ANSI ones. However, it still doesn't work. Here is my code:

#include <stdio.h>
#include <io.h>
#include <fcntl.h>

#include <windows.h>

int main(int argc, char *argv[])
{
    wchar_t s[64];

    memset(s, 0, 64 * sizeof(wchar_t));

    _setmode(_fileno(stdin), _O_U16TEXT);       
    fgetws(s, 64, stdin);

    wprintf(L"Result: %d\n", s[0]);

    return 0;
}   

输入A时,程序将打印Result: 3393,但我希望它是65.输入Ä时,程序将打印Result: 0,但我希望它是196.到底发生了什么事?为什么现在甚至都不能使用ASCII字符?我仅使用fgets()的旧程序就可以对A这样的ASCII字符正常工作,而对于Ä这样的非ASCII字符则只能失败.但是新版本甚至不能用于ASCII字符,或者3393A的正确结果吗?我希望它是65.我现在很困惑...请帮助!

When entering A the program prints Result: 3393 but I'd expect it to be 65. When entering Ä the program prints Result: 0 but I'd expect it to be 196. What the heck is going on there? Why isn't even working for ASCII characters now? My old program using just fgets() worked correctly for ASCII characters like A, it only failed for non-ASCII characters like Ä. But the new version doesn't even work for ASCII characters or is 3393 the correct result for A? I'd expect it to be 65. I'm pretty confused now... help please!

推荐答案

所有Windows本机字符串操作(很少有例外)都在UNICODE(UTF-16)中-因此我们必须在任何地方使用unicode函数.使用ANSI变体-非常不好的做法.如果您将在示例中使用unicode函数-所有操作都将正确进行.与ANSI这不能通过.. Windows错误工作! 我可以用所有细节(在win 8.1上进行研究)对此进行介绍:

All windows native string manipulations (with very rarely exceptions) was in UNICODE (UTF-16) - so we must use unicode functions anywhere. use ANSI variant - very bad practice. if you will be use unicode functions in your example - all will be work correct. with ANSI this not work by .. windows bug ! i can cover this with all details (researched on win 8.1):

1)在控制台服务器进程中存在2个全局变量:

1) in console server process exist 2 global variables:

UINT gInputCodePage, gOutputCodePage;

它可以由GetConsoleCP/SetConsoleCP和GetConsoleOutputCP/SetConsoleOutputCP读取/写入. 需要转换时,它们用作WideCharToMultiByte/MultiByteToWideChar的第一个参数.如果您仅使用unicode函数-他们从未使用过

it can be read/write by GetConsoleCP/SetConsoleCP and GetConsoleOutputCP/SetConsoleOutputCP. they used as first argument for WideCharToMultiByte/MultiByteToWideChar when need convert. if you use only unicode functions - they never used

2.a)当您写入控制台UNICODE文本时-它将原样写入而无需进行任何转换.在服务器端,这是通过SB_DoSrvWriteConsole函数完成的.看图片: 2.b)当您写入控制台ANSI文本时-也会调用SB_DoSrvWriteConsole,但还有一个附加步骤-MultiByteToWideChar(gOutputCodePage,...)-您的文本将首先转换为UNICODE. 但是这里有片刻.看: 在MultiByteToWideChar中调用cchWideChar == cbMultiByte.如果我们仅使用英语"字符集(chars< 0x80),则UNICODE的长度和char中的多字节字符串始终相等,但使用另一种语言-普通的多字节版本使用的字符数比UNICODE多,但是这不是问题,只是缓冲区的大小则需要更多,但是还可以.因此,您的printf通常可以正常工作.仅一个注释-如果您在源代码中对多字节字符串进行硬编码-所有注释中的更快将以CP_ACP形式出现,并使用CP_UTF8转换为UNICODE-会给出错误的结果.因此,这取决于您的源文件以哪种格式保存在磁盘上:)

2.a) when you write to console UNICODE text - it will be writen as is without any conversions. on server side this done in SB_DoSrvWriteConsole function. look picture: 2.b) when you write to console ANSI text - SB_DoSrvWriteConsole also will be called, but with one additional step - MultiByteToWideChar(gOutputCodePage, ...) - your text will be converted to UNICODE first. but here one moment. look: in MultiByteToWideChar call cchWideChar == cbMultiByte. if we use only 'english' charset (chars < 0x80) length of UNICODE and multibyte strings in chars always equal, but with another languages - usual Multibyte version use more chars than UNICODE but here this is not problem, simply size of out buffer more then need, but it is ok. so you printf in general will be work correct. one note only - if you hardcode multibyte string in source code - faster of all it will be in CP_ACP form, and conversion to UNICODE with CP_UTF8 - give incorrect result. so this is depended in which format your source file saved on disk :)

3.a)当您从带有UNICODE功能的控制台中读取时-您完全获得了UNICODE文本.这里没有任何问题.如果需要-您可以直接将其直接转换为多字节

3.a) when you read from console with UNICODE functions - you got exactly UNICODE text as is. here no any problem. if need - you can then direct by self convert it to multibyte

3.b)当您从具有ANSI功能的控制台中读取时-服务器首先将UNICODE字符串转换为ANSI,然后返回到ANSI形式.这是通过功能完成的

3.b) when you read from console with ANSI functions - server first convert UNICODE string to ANSI, and then return to you ANSI form. this done by function

int ConvertToOem(UINT CodePage /*=gInputCodePage*/, PCWSTR lpWideCharStr, int cchWideChar, PSTR lpMultiByteStr, int cbMultiByte)
{
    if (CodePage == g_OEMCP)
    {
        ULONG BytesInOemString;
        return 0 > RtlUnicodeToOemN(lpMultiByteStr, cbMultiByte, &BytesInOemString, lpWideCharStr, cchWideChar * sizeof(WCHAR)) ? 0 : BytesInOemString;
    }
    return WideCharToMultiByte(CodePage, 0, lpWideCharStr, cchWideChar, lpMultiByteStr, cbMultiByte, 0, 0);
}

但让我们更仔细地看一下,ConvertToOem是如何调用的: 这里再次cbMultiByte == cchWideChar,但这是100%的错误!多字节字符串可以长于UNICODE(当然,以chars表示).例如Ä"-这是1个UNICODE字符和2个UTF8字符.结果WideCharToMultiByte 返回0.(ERROR_INSUFFICIENT_BUFFER)

but let look more close, how ConvertToOem called: here again cbMultiByte == cchWideChar, but this is 100% bug ! multibyte string can be longer than UNICODE (in chars of course) . for example "Ä" - this is 1 UNICODE char and 2 UTF8 chars. as result WideCharToMultiByte return 0. (ERROR_INSUFFICIENT_BUFFER )

这篇关于在Windows上使用fgets()从stdin读取UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆