如何在C代码中使用UTF-8? [英] How to use UTF-8 in C code?
问题描述
我的设置:gcc-4.9.2,UTF-8环境.
My setup: gcc-4.9.2, UTF-8 environment.
以下C程序可以ASCII格式运行,但不能以UTF-8格式运行.
The following C-program works in ASCII, but does not in UTF-8.
创建输入文件:
echo -n 'привет мир' > /tmp/вход
这是test.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define SIZE 10
int main(void)
{
char buf[SIZE+1];
char *pat = "привет мир";
char str[SIZE+2];
FILE *f1;
FILE *f2;
f1 = fopen("/tmp/вход","r");
f2 = fopen("/tmp/выход","w");
if (fread(buf, 1, SIZE, f1) > 0) {
buf[SIZE] = 0;
if (strncmp(buf, pat, SIZE) == 0) {
sprintf(str, "% 11s\n", buf);
fwrite(str, 1, SIZE+2, f2);
}
}
fclose(f1);
fclose(f2);
exit(0);
}
检查结果:
./test; grep -q ' привет мир' /tmp/выход && echo OK
应该采取什么措施才能使UTF-8代码像ASCII代码一样工作-不用理会符号占用多少字节,等等.换句话说:在示例中要更改的内容以处理任何UTF-8符号作为一个单元(包括argv,STDIN,STDOUT,STDERR,文件输入,输出和程序代码)?
What should be done to make UTF-8 code work as if it was ASCII code - not to bother how many bytes a symbol takes, etc. In other words: what to change in the example to treat any UTF-8 symbol as a single unit (that includes argv, STDIN, STDOUT, STDERR, file input, output and the program code)?
推荐答案
#define SIZE 10
缓冲区大小10不足以存储UTF-8字符串привет мир
.尝试将其更改为更大的值.在我的系统(Ubuntu 12.04,gcc 4.8.1)上,将其更改为20可以正常工作.
The buffer size of 10 is insufficient to store the UTF-8 string привет мир
. Try changing it to a larger value. On my system (Ubuntu 12.04, gcc 4.8.1), changing it to 20, worked perfectly.
UTF-8是一种多字节编码,每个字符使用1到4个字节.因此,将40作为上面的缓冲区大小是更安全的. 每个Unicode字符占用多少字节?可能很有趣.
UTF-8 is a multibyte encoding which uses between 1 and 4 bytes per character. So, it is safer to use 40 as the buffer size above. There is a big discussion at How many bytes does one Unicode character take? which might be interesting.
这篇关于如何在C代码中使用UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!