c读取非ASCII字符 [英] c reading non ASCII characters

查看:84
本文介绍了c读取非ASCII字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析一个包含诸如æ ø å之类的字符的文件。如果我们假设我已按如下方式存储文本文件的一行

I am parsing a file that involves characters such as æ ø å. If we assume I have stored a line of the text file as follows

#define MAXLINESIZE 1024
char* buffer = malloc(MAXLINESIZE)
...
fgets(buffer,MAXLINESIZE,handle)
...

如果我想计数一行上的字符数。如果我尝试执行以下操作:

if I wanted to count the number of characters on a line. If I try to do the following:

char* p = buffer
int count = 0;
while (*p != '\n') {
    if (isgraph(*p)) {
        count++;
    }
    p++;
}

这会忽略æ ø å

即:计算aåeæioøu将返回5而不是8

ie: counting "aåeæioøu" would return 5 not 8

我是否需要以其他方式读取文件?我应该使用 char * 而不是 int * 吗?

do I need to read the file in an alternative way? should I not be using a char* but an int*?

推荐答案

您需要了解字符使用哪种编码。我猜很可能是 UTF-8 (您应该使用到处都是UTF8 ....),请阅读乔尔关于Unicode的博客。如果您的编码不是UTF-8,则应将其转换为UTF-8,例如使用 libiconv

You need to understand which encoding is used for your characters. I guess it is very probably UTF-8 (and you should use UTF8 everywhere....), read Joel's blog on Unicode. If your encoding is not UTF-8 you should convert it to UTF-8 e.g. using libiconv.

然后您需要UTF-8的C库。其中有很多(但 C11 语言尚未标准化) 。我建议 libunistring glib (来自GTK),但另请参见

Then you need a C library for UTF-8. There are many of them (but none is standardized in the C11 language yet). I recommend libunistring or glib (from GTK), but see also this.

您的代码将更改,因为UTF-8字符可以占用一到四个[8位]字节(但是Wikipedia UTF-8 页最多提及6个字节;请参见 Unicode 标准以获取详细信息)。您不会测试某个字节(即纯C char )是否为字母,而是测试一个字节及其后的几个字节(由指针赋予,即 char * 或更佳的是 uint8_t * )来编码字母(包括西里尔字母等)。

Your code will change, since an UTF-8 character can take one to four [8 bits] bytes (but Wikipedia UTF-8 page mentions 6 bytes at most; See Unicode standards for details). You won't test if a byte (i.e. a plain C char) is a letter, but if a byte and the few bytes after it (given by a pointer, i.e. a char* or better by uint8_t*) encode a letter (including cyrillic letters, etc..).

并非每个字节序列都是有效的UTF-8表示形式,并且您可能想 validate 一行(或以N结尾的C字符串)之前进行分析。

Not every sequence of bytes is a valid UTF-8 representation, and you might want to validate a line (or a null-terminated C string) before analyzing it.

这篇关于c读取非ASCII字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆