如何比较C中的多字节字符 [英] How to compare multibyte characters in C
问题描述
我尝试解析文本并在其中找到一些字符.我使用下面的代码.它适用于 abcdef
等普通字符,但不适用于 öçşğüı
.GCC 给出编译警告.我应该怎么做才能使用 öçşğüı
?
I try to parse text and find some characters in it. I use the code below. It works with normal characters like abcdef
but it does not work with öçşğüı
. GCC gives compilation warnings. What should I do to work with öçşğüı
?
代码:
#include <stdio.h>
#include <ctype.h>
#include <string.h>
int main()
{
char * text = "öçşğü";
int i=0;
text = strdup(text);
while (text[i])
{
if(text[i] == 'ö')
{
printf("ö
");
}
i++;
}
return 0;
}
警告:
warning: multi-character character constant [-Wmultichar]
warning: comparison is always false due to limited range of data type [-Wtype-limits]
<小时>
在while循环中打印char的地址时有10个地址
There are 10 addresses when I print address of char in while loop
printf("%d : %p
", i, text[i]);
输出:
0 : 0xffffffc3
1 : 0xffffffb6
2 : 0xffffffc3
3 : 0xffffffa7
4 : 0xffffffc5
5 : 0xffffff9f
6 : 0xffffffc4
7 : 0xffffff9f
8 : 0xffffffc3
9 : 0xffffffbc
而 strlen
是 10.
但如果我使用 abcde
:
0 : 0x61
1 : 0x62
2 : 0x63
3 : 0x64
4 : 0x65
而 strlen
是 5.
如果我使用 wchar_t
进行文本输出是
If I use wchar_t
for text output is
0 : 0xa7c3b6c3
1 : 0x9fc49fc5
2 : 0xbcc3
strlen
为 10,wcslen
为 3.
and strlen
is 10, wcslen
is 3.
推荐答案
要遍历字符串中的每个字符,可以使用mblen
.您还需要设置正确的语言环境(多字节字符串所代表的编码),以便mblen
能够正确解析多字节字符串.
To go through each of the characters in the string, you can use mblen
. You also need to set the correct locale (the encoding represented by the multi byte string), so that mblen
can correctly parse the multi byte string.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
char * text = "öçşğü";
int i=0, char_len;
setlocale(LC_CTYPE, "en_US.utf8");
while ((char_len = mblen(&text[i], MB_CUR_MAX)) > 0)
{
/* &text[i] contains multibyte character of length char_len */
if(memcmp(&text[i], "ö", char_len) == 0)
{
printf("ö
");
}
i += char_len;
}
return 0;
}
有 2 种类型的字符串表示,使用多字节(8 位字节)或宽字节(大小取决于平台).多字节表示具有可以使用 char *
表示的优点(代码中通常使用 c 字符串),但缺点是多个字节表示一个字符.宽字符串使用 wchar_t *
表示.wchar_t
具有一个 wchar_t 是一个字符的优点(但是正如@anatolyg 指出的那样,在 wchar_t 无法表示所有可能字符的平台上,这种假设仍然可能出错).
There are 2 types of string representation, using multi-byte (8-bit bytes) or wide byte (size depends on platform). Multi-byte representation has the advantage it can be represented using char *
(usual c string as in your code), but has disadvantage that multiple bytes represent a character. Wide string is represented using wchar_t *
. wchar_t
has the advantage that one wchar_t is one character (However as @anatolyg pointed out, this assumption can still go wrong in platforms where wchar_t is not able to represent all possible characters).
您是否使用十六进制编辑器查看过您的源代码?字符串 "öçşğü"
实际上由内存中的多字节字符串 c3 b6 c3 a7 c5 9f c4 9f c3 bc
表示(UTF-8 编码),当然是零终止.您看到 5 个字符只是因为您的 UTF-8 感知查看器/浏览器正确呈现了字符串.很容易意识到 strlen(text)
为此返回 10,而上面的代码只循环了 5 次.
Have you looked at your source code using a hex editor? The string "öçşğü"
actually is represented by multi byte string c3 b6 c3 a7 c5 9f c4 9f c3 bc
in memory (UTF-8 encoding), of course with zero termination. You see 5 characters just because the string is rendered correctly by your UTF-8 aware viewer/browser. It is simple to realize that strlen(text)
returns 10 for this, whereas the above code loops only 5 times.
如果您使用宽字节字符串,可以按照@WillBriggs 的说明进行操作.
If you use wide-byte string, it can be done as explained by @WillBriggs.
这篇关于如何比较C中的多字节字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!