如何在C代码中使用UTF-8? [英] How to use UTF-8 in C code?

查看:565
本文介绍了如何在C代码中使用UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的设置:gcc-4.9.2,UTF-8环境.

My setup: gcc-4.9.2, UTF-8 environment.

以下C程序可以ASCII格式运行,但不能以UTF-8格式运行.

The following C-program works in ASCII, but does not in UTF-8.

创建输入文件:

echo -n 'привет мир' > /tmp/вход

这是test.c:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZE 10

int main(void)
{
  char buf[SIZE+1];
  char *pat = "привет мир";
  char str[SIZE+2];

  FILE *f1;
  FILE *f2;

  f1 = fopen("/tmp/вход","r");
  f2 = fopen("/tmp/выход","w");

  if (fread(buf, 1, SIZE, f1) > 0) {
    buf[SIZE] = 0;

    if (strncmp(buf, pat, SIZE) == 0) {
      sprintf(str, "% 11s\n", buf);
      fwrite(str, 1, SIZE+2, f2);
    }
  }

  fclose(f1);
  fclose(f2);

  exit(0);
}

检查结果:

./test; grep -q ' привет мир' /tmp/выход && echo OK

应该采取什么措施才能使UTF-8代码像ASCII代码一样工作-不用理会符号占用多少字节,等等.换句话说:在示例中要更改的内容以处理任何UTF-8符号作为一个单元(包括argv,STDIN,STDOUT,STDERR,文件输入,输出和程序代码)?

What should be done to make UTF-8 code work as if it was ASCII code - not to bother how many bytes a symbol takes, etc. In other words: what to change in the example to treat any UTF-8 symbol as a single unit (that includes argv, STDIN, STDOUT, STDERR, file input, output and the program code)?

推荐答案

#define SIZE 10

缓冲区大小10不足以存储UTF-8字符串привет мир.尝试将其更改为更大的值.在我的系统(Ubuntu 12.04,gcc 4.8.1)上,将其更改为20可以正常工作.

The buffer size of 10 is insufficient to store the UTF-8 string привет мир. Try changing it to a larger value. On my system (Ubuntu 12.04, gcc 4.8.1), changing it to 20, worked perfectly.

UTF-8是一种多字节编码,每个字符使用1到4个字节.因此,将40作为上面的缓冲区大小是更安全的. 每个Unicode字符占用多少字节?可能很有趣.

UTF-8 is a multibyte encoding which uses between 1 and 4 bytes per character. So, it is safer to use 40 as the buffer size above. There is a big discussion at How many bytes does one Unicode character take? which might be interesting.

这篇关于如何在C代码中使用UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆