是否可能“强制” UTF-8在C程序中? [英] Is it possible "force" UTF-8 in a C program?

查看:145
本文介绍了是否可能“强制” UTF-8在C程序中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常当我想要我的程序使用UTF-8编码时,我写 setlocale(LC_ALL,); 。但是今天我发现它只是设置为环境的默认区域设置,我不知道环境是否在默认情况下使用UTF-8。



我想知道有没有任何方式将强制字符编码为UTF-8?另外,有没有办法检查我的程序是否使用UTF-8?

解决方案

这是可能的,但它是完全错误的事情。



首先,当前的区域设置是供用户决定的。它不仅仅是字符集,而且是语言,日期和时间格式等等。您的程序绝对没有正确来混淆它。



如果您无法本地化程序,只需告诉用户您的程序具有的环境要求,并让他们担心关于它。



真的,你不应该真正依靠UTF-8作为当前的编码,而是使用广泛的字符支持,包括像 wctype() mbstowcs() , 等等。 POSIXy系统还提供 iconv_open() iconv() 函数系列中的C库可以在编码之间进行转换(应始终包含转换为 wchar_t );在Windows上,您需要单独的版本 libiconv 库。这是例如GCC编译器处理不同字符集的方式。 (在内部,它使用Unicode / UTF-8,但是如果您要求它,它可以进行必要的转换以与其他字符集一起使用。)



我个人使用UTF-8的强大支持者,但在程序中覆盖用户区域设置是可怕的。可恶。令人反感;像桌面小程序改变显示分辨率,因为程序员特别喜欢某些程序员。



我很乐意编写一些示例代码来显示如何正确地解决任何字符但是有这么多的情况,但是有这么多的,我不知道从哪里开始。



如果OP修改他们的问题来确定什么问题覆盖字符集应该解决,我愿意展示如何使用上述实用程序和POSIX工具(或Windows上的等效免费库)来正确解决它。



如果这对某人来说似乎很苛刻,那只是因为在这里采用简单而简单的路由(覆盖用户的区域设置)是如此...错误,纯粹是在技术依据即使没有动作更好,实际上是完全可以接受的,只要您记录您的应用程序只处理UTF-8输入/输出。






示例1.本地化新年快乐

  #include< stdlib.h> ; 
#include< locale.h>
#include< stdio.h>
#include< wchar.h>

int main(void)
{
/ *我们希望使用用户当前的语言环境。 * /
setlocale(LC_ALL,);

/ *我们打算在标准输出上使用广泛的功能。 * /
fwide(stdout,1);

/ *对于Windows兼容性,打印出一个字节顺序标记。
*如果将输出保存到文件,这有助于告诉Windows
*应用程序该文件是Unicode。
*其他系统不需要它也不使用它。
* /
fputwc(L'\FFF',stdout);

wprintf(LHappy New Year!\\\
);
wprintf(LСновымгодом!\\\
);
wprintf(L新年好!\\\
);
wprintf(L贺正!\\\
);
wprintf(L¡Felizañonuevo!\\\
);
wprintf(LHyvääuutta vuotta!\\\
);

return EXIT_SUCCESS;
}

请注意,wprintf()采用宽字符串(宽字符串常量为形式 L,宽字符常量 L',而不是正常/窄对应 ')。格式还是一样的%s 打印一个普通/窄字符串,而%ls 一个宽字符串。






示例2.从标准输入读取输入行,并可选择将其保存到文件中。文件名在命令行中提供。

  #include< stdlib.h> 
#include< string.h>
#include< locale.h>
#include< wctype.h>
#include< wchar.h>
#include< errno.h>
#include< stdio.h>

typedef枚举{
TRIM_LEFT = 1,/ *删除前导空格和控制字符* /
TRIM_RIGHT = 2,/ *删除尾随空格和控制字符* /
TRIM_NEWLINE = 4,/ *删除行末尾的换行符* /
TRIM = 7,/ *删除前导和尾随空格和控制字符* /
OMIT_NUL = 8,/ *跳过NUL个字符$ /
CLEANUP = 31,/ *以上所有的。 * /
COMBINE_LWS = 32,/ *将所有空格组合成一个空格* /
} trim_opts;


/ *从宽输入流读取无限长的行。
*
*此函数指向一个宽字符串指针
*指向动态分配的宽字符数,
*要读取的流,以及一套关于如何对待线路的选项。
*
*如果发生错误,则返回0,并将errno设置为非零错误号。
*使用strerror(errno)获取错误描述(作为一个窄字符串)。
*
*如果没有更多数据可以从流中读取,
*将返回0,其中errno为0,feof(stream)将返回true。
*
*如果读取空行,则
*将返回0,并使用errno 0,但feof(stream)将返回false。
*
*通常,您初始化变量
* wchar_t * line = NULL;
* size_t size = 0;
*在调用此函数之前,以便后续调用相同,动态地
*分配缓冲区为该行,如果需要,它将自动增长。
*这种方式没有线路长度的内置限制。
* /
size_t getwline(wchar_t ** const lineptr,
size_t * const sizeptr,
FILE * const in,
trim_opts const trim)
{
wchar_t * line;
size_t size;
size_t used = 0;
wint_t wc;
fpos_t startpos;
int可寻求;

if(lineptr == NULL || sizeptr == NULL || in == NULL){
errno = EINVAL;
return 0;
}

if(* lineptr!= NULL){
line = * lineptr;
size = * sizeptr;
} else {
line = NULL;
size = 0;
* sizeptr = 0;
}

/ *在错误的情况下,我们可以尝试在输入流中返回此位置
*,因为我们不能真正返回数据
*读到目前为止。然而,像管道这样的流水是不可寻求的,
*,所以在这些情况下,我们不应该尝试。
*使用(可寻找)作为标志来记住我们是否应该尝试。
* /
if(fgetpos(in,& startpos)== 0)
seekable = 1;
else
seekable = 0;

while(1){

/ *当我们从宽流中读取宽字符时,
* fgetwc()将返回WEOF,如果发生错误
*但是,如果流中没有更多输入,fgetwc()将返回WEOF,其中errno *不变*
*。
*要检测出两个发生的情况,首先需要清除errno
*。
* /
errno = 0;
wc = fgetwc(in);
if(wc == WEOF){
if(errno){
const int saved_errno = errno;
if(seekable)
fsetpos(in,& startpos);
errno = saved_errno;
return 0;
}
if(ferror(in)){
if(seekable)
fsetpos(in,& startpos);
errno = EIO;
return 0;
}
break;
}

/ *如有必要,动态增长行缓冲区。
*我们需要当前宽字符
*加上至少字符串结束符号L'\0的空间。
* /
if(used + 2> size){
/ *大小策略。这可以是任何你认为合适的东西,
*只要它产生大小> = used + 2.
*
*这个将大小增加到
* 1024的下一个倍数(减16)。它在实践中运行良好,
*,但不要认为它是最好的方式。
*这只是一个健壮的选择。
* /
size =(used | 1023)+ 1009;
line = realloc(line,size * sizeof line [0]);
if(!line){
/ *内存分配失败。 * /
if(seekable)
fsetpos(in,& startpos);
errno = ENOMEM;
return 0;
}
* lineptr = line;
* sizeptr = size;
}

/ *将字符附加到缓冲区。 * /
if(!trimming)
line [used ++] = wc;
else {
/ *检查我们是否有理由不将字符添加到缓冲区。 * /
do {
/ *如果要求,请忽略NUL。 * /
如果(修剪& OMIT_NUL)
if(wc == L'\0')
break;

/ *如果要求,请忽略控制。 * /
如果(修剪& OMIT_CONTROLS)
if(iswcntrl(wc))
break;

/ *如果我们是开始行,我们是左边修剪,
*只添加图形(可打印的非空格字符)。 * /
如果(修剪& TRIM_LEFT)
if(wc == L'\0'||!iswgraph(wc))
break;

/ *如果要求,请合并空格。 * /
如果(修整和COMBINE_LWS)
if(iswspace(wc)){
if(used> 0&& line [used-1] == L' )
break;
else
wc = L
}

/ *好的,将字符添加到缓冲区。 * /
line [used ++] = wc;

} while(0);
}

/ *结束行? * /
if(wc == L'\\\
')
break;
}

/ *如果找到行尾或输入结尾,上述循环只会结束(分解)
*,
*,没有发生错误。
* /

/ *如果被问到,修剪右。 * /
如果(修剪& TRIM_RIGHT)
while(已使用> 0&& iswspace(行[used-1]))
-
else
if(trimming& TRIM_NEWLINE)
while(used> 0&&(line [used-1] == L'\r'|| line [used -1] == L'\\\
'))
- 已使用;

/ *确保我们有空的字符串L'\0'。 * /
if(used> = size){
size = used + 1;
line = realloc(line,size * sizeof line [0]);
if(!line){
if(seekable)
fsetpos(in,& startpos);
errno = ENOMEM;
return 0;
}
* lineptr = line;
* sizeptr = size;
}

/ *添加字符串标记的结尾。 * /
line [used] = L'\0';

/ *成功返回。 * /
errno = 0;
返回使用;
}

/ *计算'alpha'类中宽字符的数量。
* /
size_t count_letters(const wchar_t * ws)
{
size_t count = 0;
if(ws)
while(* ws!= L'\0')
if(iswalpha(*(ws ++)))
count ++;
返回计数;
}

int main(int argc,char * argv [])
{
FILE * out;

wchar_t * line = NULL;
size_t size = 0;
size_t len;

setlocale(LC_ALL,);

/ *标准输入和输出应使用宽字符。 * /
fwide(stdin,1);
fwide(stdout,1);

/ *检查用户是否要求帮助。 * /
if(argc< 2 || argc> 3 || strcmp(argv [1],-h)== 0 || strcmp(argv [1],--help) == 0 || strcmp(argv [1],/?)== 0){
fprintf(stderr,\\\
);
fprintf(stderr,用法:%s [-h | --help | /?] \\\
,argv [0]);
fprintf(stderr,%s FILENAME [PROMPT] \\\
,argv [0]);
fprintf(stderr,\\\
);
fprintf(stderr,程序将读取输入行直到只有'。'被提供。\\\
);
fprintf(stderr,如果不想将输出保存到文件\\\
);
fprintf(stderr,use - 作为FILENAME.\\\
);
fprintf(stderr,\\\
);
return EXIT_SUCCESS;
}

/ *打开输出文件,除非是 - 。 * /
if(strcmp(argv [1], - )== 0)
out = NULL; / *没有输出到文件* /
else {
out = fopen(argv [1],w);
if(out == NULL){
fprintf(stderr,%s:%s.\\\
,argv [1],strerror(errno));
return EXIT_FAILURE;
}

/ *输出文件用于宽字符串。 * /
fwide(out,1);
}

while(1){

/ *提示?注意:我们的提示字符串很窄,但是stdout很宽。 * /
if(argc> 2){
wprintf(L%s\\\
,argv [2]);
fflush(stdout);
}

len = getwline(& line,& size,stdin,CLEANUP);
if(len == 0){
if(errno){
fprintf(stderr,读取标准输入错误:%s.\\\
,strerror(errno));
break;
}
if(feof(stdin))
break;
}

/ *用户不希望提供更多的行? * /
if(wcscmp(line,L。)== 0)
break;

/ *打印文件的行。 * /
if(out!= NULL){
fputws(line,out);
fputwc(L'\',out);
}

/ *告诉用户我们读什么* /
wprintf(L收到%lu宽字符,%lu是letterlike.\\\

(unsigned long)len,(unsigned long)count_letters(line));
fflush(stdout);
}

/ *不再需要行缓冲区,所以我们可以丢弃它。
*请注意,free(NULL)是安全的,所以我们不需要检查。
* /
free(line);

/ *我个人也想重置变量。
*它有助于调试,并避免重复使用以后的()错误。 * /
line = NULL;
size = 0;

return EXIT_SUCCESS;
}

getwline()上面的函数是处理本地化的宽字符支持时可能需要的最复杂的函数结尾。它允许您读取不受长度限制的本地化输入行,并可选择修剪和清除(删除控制代码和嵌入式二进制零)返回的字符串。它也适用于LF和CR-LF( \\\
\r\\\
)换行符编码。


Usually when I want my program to use UTF-8 encoding, I write setlocale (LC_ALL, "");. But today I found that it's just setting locate to environment's default locale, and I can't know whether the environment is using UTF-8 by default.

I wonder is there any way to force the character encoding to be UTF-8? Also, is there any way to check whether my program is using UTF-8?

解决方案

It is possible, but it is the completely wrong thing to do.

First of all, the current locale is for the user to decide. It is not just the character set, but also the language, date and time formats, and so on. Your program has absolutely no "right" to mess with it.

If you cannot localize your program, just tell the user the environmental requirements your program has, and let them worry about it.

Really, you should not really rely on UTF-8 being the current encoding, but use wide character support, including functions like wctype(), mbstowcs(), and so on. POSIXy systems also provide iconv_open() and iconv() function family in their C libraries to convert between encodings (which should always include conversion to and from wchar_t); on Windows, you need a separate version libiconv library. This is how for example the GCC compiler handles different character sets. (Internally, it uses Unicode/UTF-8, but if you ask it to, it can do the necessary conversions to work with other character sets.)

I am personally a strong proponent of using UTF-8 everywhere, but overriding the user locale in a program is horrific. Abominable. Distasteful; like a desktop applet changing the display resolution because the programmer is particularly fond of certain one.

I would be happy to write some example code to show how to correctly solve any character-set-sensible situation, but there are so many, I don't know where to start.

If the OP amends their question to state exactly what problem overriding the character set is supposed to solve, I'm willing to show how to use the aforementioned utilities and POSIX facilities (or equivalent freely available libraries on Windows) to solve it correctly.

If this seems harsh to someone, it is, but only because taking the easy and simple route here (overriding the user's locale setting) is so ... wrong, purely on technical grounds. Even no action is better, and actually quite acceptable, as long as you just document your application only handles UTF-8 input/output.


Example 1. Localized Happy New Year!

#include <stdlib.h>
#include <locale.h>
#include <stdio.h>
#include <wchar.h>

int main(void)
{
    /* We wish to use the user's current locale. */
    setlocale(LC_ALL, "");

    /* We intend to use wide functions on standard output. */
    fwide(stdout, 1);

    /* For Windows compatibility, print out a Byte Order Mark.
     * If you save the output to a file, this helps tell Windows
     * applications that the file is Unicode.
     * Other systems don't need it nor use it.
    */
    fputwc(L'\uFEFF', stdout);

    wprintf(L"Happy New Year!\n");
    wprintf(L"С новым годом!\n");
    wprintf(L"新年好!\n");
    wprintf(L"賀正!\n");
    wprintf(L"¡Feliz año nuevo!\n");
    wprintf(L"Hyvää uutta vuotta!\n");

    return EXIT_SUCCESS;
}

Note that wprintf() takes a wide string (wide string constants are of form L"", wide character constants L'', as opposed to normal/narrow counterparts "" and ''). Formats are still the same; %s prints a normal/narrow string, and %ls a wide string.


Example 2. Reading input lines from standard input, and optionally saving them to a file. The file name is supplied on the command line.

#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <wctype.h>
#include <wchar.h>
#include <errno.h>
#include <stdio.h>

typedef enum {
    TRIM_LEFT     = 1,      /* Remove leading whitespace and control characters */
    TRIM_RIGHT    = 2,      /* Remove trailing whitespace and control characters */
    TRIM_NEWLINE  = 4,      /* Remove newline at end of line */
    TRIM          = 7,      /* Remove leading and trailing whitespace and control characters */
    OMIT_NUL      = 8,      /* Skip NUL characters (embedded binary zeros, L'\0') */
    OMIT_CONTROLS = 16,     /* Skip control characters */
    CLEANUP       = 31,     /* All of the above. */
    COMBINE_LWS   = 32,     /* Combine all whitespace into a single space */
} trim_opts;


/* Read an unlimited-length line from a wide input stream.
 *
 * This function takes a pointer to a wide string pointer,
 * pointer to the number of wide characters dynamically allocated for it,
 * the stream to read from, and a set of options on how to treat the line.
 *
 * If an error occurs, this will return 0 with errno set to nonzero error number.
 * Use strerror(errno) to obtain the error description (as a narrow string).
 *
 * If there is no more data to read from the stream,
 * this will return 0 with errno 0, and feof(stream) will return true.
 *
 * If an empty line is read,
 * this will return 0 with errno 0, but feof(stream) will return false.
 *
 * Typically, you initialize variables like
 *      wchar_t *line = NULL;
 *      size_t   size = 0;
 * before calling this function, so that subsequent calls the same, dynamically
 * allocated buffer for the line, and it is automatically grown if necessary.
 * There are no built-in limits to line lengths this way.
*/
size_t getwline(wchar_t **const lineptr,
                size_t   *const sizeptr,
                FILE     *const in,
                trim_opts const trimming)
{
    wchar_t *line;
    size_t   size;
    size_t   used = 0;
    wint_t   wc;
    fpos_t   startpos;
    int      seekable;

    if (lineptr == NULL || sizeptr == NULL || in == NULL) {
        errno = EINVAL;
        return 0;
    }

    if (*lineptr != NULL) {
        line = *lineptr;
        size = *sizeptr;
    } else {
        line = NULL;
        size = 0;
        *sizeptr = 0;
    }

    /* In error cases, we can try and get back to this position
     * in the input stream, as we cannot really return the data
     * read thus far. However, some streams like pipes are not seekable,
     * so in those cases we should not even try.
     * Use (seekable) as a flag to remember if we should try.
    */
    if (fgetpos(in, &startpos) == 0)
        seekable = 1;
    else
        seekable = 0;

    while (1) {

        /* When we read a wide character from a wide stream,
         * fgetwc() will return WEOF with errno set if an error occurs.
         * However, fgetwc() will return WEOF with errno *unchanged*
         * if there is no more input in the stream.
         * To detect which of the two happened, we need to clear errno
         * first.
        */
        errno = 0;
        wc = fgetwc(in);
        if (wc == WEOF) {
            if (errno) {
                const int saved_errno = errno;
                if (seekable)
                    fsetpos(in, &startpos);
                errno = saved_errno;
                return 0;
            }
            if (ferror(in)) {
                if (seekable)
                    fsetpos(in, &startpos);
                errno = EIO;
                return 0;
            }
            break;
        }

        /* Dynamically grow line buffer if necessary.
         * We need room for the current wide character,
         * plus at least the end-of-string mark, L'\0'.
        */
        if (used + 2 > size) {
            /* Size policy. This can be anything you see fit,
             * as long as it yields size >= used + 2.
             *
             * This one increments size to next multiple of
             * 1024 (minus 16). It works well in practice,
             * but do not think of it as the "best" way.
             * It is just a robust choice.
            */
            size = (used | 1023) + 1009;
            line = realloc(line, size * sizeof line[0]);
            if (!line) {
                /* Memory allocation failed. */
                if (seekable)
                    fsetpos(in, &startpos);
                errno = ENOMEM;
                return 0;
            }
            *lineptr = line;
            *sizeptr = size;
        }

        /* Append character to buffer. */
        if (!trimming)
            line[used++] = wc;
        else {
            /* Check if we have reasons to NOT add the character to buffer. */
            do {
                /* Omit NUL if asked to. */
                if (trimming & OMIT_NUL)
                    if (wc == L'\0')
                        break;

                /* Omit controls if asked to. */
                if (trimming & OMIT_CONTROLS)
                    if (iswcntrl(wc))
                        break;

                /* If we are at start of line, and we are left-trimming,
                 * only graphs (printable non-whitespace characters) are added. */
                if (trimming & TRIM_LEFT)
                    if (wc == L'\0' || !iswgraph(wc))
                        break;

                /* Combine whitespaces if asked to. */
                if (trimming & COMBINE_LWS)
                    if (iswspace(wc)) {
                        if (used > 0 && line[used-1] == L' ')
                            break;
                        else
                            wc = L' ';
                    }

                /* Okay, add the character to buffer. */
                line[used++] = wc;

            } while (0);
        }

        /* End of the line? */
        if (wc == L'\n')
            break;
    }

    /* The above loop will only end (break out)
     * if end of line or end of input was found,
     * and no error occurred.
    */

    /* Trim right if asked to. */
    if (trimming & TRIM_RIGHT)
        while (used > 0 && iswspace(line[used-1]))
            --used;
    else
    if (trimming & TRIM_NEWLINE)
        while (used > 0 && (line[used-1] == L'\r' || line[used-1] == L'\n'))
            --used;

    /* Ensure we have room for end-of-string L'\0'. */
    if (used >= size) {
        size = used + 1;
        line = realloc(line, size * sizeof line[0]);
        if (!line) {
            if (seekable)
                fsetpos(in, &startpos);
            errno = ENOMEM;
            return 0;
        }
        *lineptr = line;
        *sizeptr = size;
    }

    /* Add end of string mark. */
    line[used] = L'\0';

    /* Successful return. */
    errno = 0;
    return used;
}

/* Counts the number of wide characters in 'alpha' class.
*/
size_t count_letters(const wchar_t *ws)
{
    size_t count = 0;
    if (ws)
        while (*ws != L'\0')
            if (iswalpha(*(ws++)))
                count++;
    return count;
}

int main(int argc, char *argv[])
{
    FILE    *out;

    wchar_t *line = NULL;
    size_t   size = 0;
    size_t   len;

    setlocale(LC_ALL, "");

    /* Standard input and output should use wide characters. */
    fwide(stdin, 1);
    fwide(stdout, 1);

    /* Check if the user asked for help. */
    if (argc < 2 || argc > 3 || strcmp(argv[1], "-h") == 0 || strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "/?") == 0) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help | /? ]\n", argv[0]);
        fprintf(stderr, "       %s FILENAME [ PROMPT ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "The program will read input lines until an only '.' is supplied.\n");
        fprintf(stderr, "If you do not want to save the output to a file,\n");
        fprintf(stderr, "use '-' as the FILENAME.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    /* Open file for output, unless it is "-". */
    if (strcmp(argv[1], "-") == 0)
        out = NULL; /* No output to file */
    else {
        out = fopen(argv[1], "w");
        if (out == NULL) {
            fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
            return EXIT_FAILURE;
        }

        /* The output file is used with wide strings. */
        fwide(out, 1);
    }

    while (1) {

        /* Prompt? Note: our prompt string is narrow, but stdout is wide. */
        if (argc > 2) {
            wprintf(L"%s\n", argv[2]);
            fflush(stdout);
        }

        len = getwline(&line, &size, stdin, CLEANUP);
        if (len == 0) {
            if (errno) {
                fprintf(stderr, "Error reading standard input: %s.\n", strerror(errno));
                break;
            }
            if (feof(stdin))
                break;
        }

        /* The user does not wish to supply more lines? */
        if (wcscmp(line, L".") == 0)
            break;

        /* Print the line to the file. */
        if (out != NULL) {
            fputws(line, out);
            fputwc(L'\n', out);
        }

        /* Tell the user what we read. */
        wprintf(L"Received %lu wide characters, %lu of which were letterlike.\n",
                (unsigned long)len, (unsigned long)count_letters(line));
        fflush(stdout);
    }

    /* The line buffer is no longer needed, so we can discard it.
     * Note that free(NULL) is safe, so we do not need to check.
    */
    free(line);

    /* I personally also like to reset the variables.
     * It helps with debugging, and to avoid reuse-after-free() errors. */
    line = NULL;    
    size = 0;

    return EXIT_SUCCESS;
}

The getwline() function above is pretty much at the most complicated end of functions you might need when dealing with localized wide character support. It allows you to read localized input lines without length restrictions, and optionally trims and cleans up (removing control codes and embedded binary zeros) the returned string. It also works fine with both LF and CR-LF (\n and \r\n) newline encodings.

这篇关于是否可能“强制” UTF-8在C程序中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆