POSIX regex.h提供unicode或基本上非ascii字符? [英] Does POSIX regex.h provide unicode or basically non-ascii characters?
问题描述
您好,我使用标准Regex库(regcomp,regexec ..)。但现在需求我应该添加unicode支持我的正则表达式的代码。
Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.
标准Regex库提供unicode或基本上是非ASCII字符吗?我在网上研究,不想。
Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.
我的项目是资源评论者,因此我不想使用大的库(ICU和Boost.Regex)。
My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).
任何帮助将不胜感激。
推荐答案
看起来POSIX Regex使用UTF-8语言环境正常工作。我只是写了一个简单的测试(见下文),并使用它匹配字符串与西里尔字符反对regex [[:alpha:]]
。一切都很好。
Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]"
(for example). And everything working just fine.
注意:你必须记住的主要事情 - 正则表达式函数是与区域设置相关的。所以你必须在它之前调用 setlocale()
。
Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale()
before it.
#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>
int main(int argc, char** argv) {
int ret;
regex_t reg;
regmatch_t matches[10];
if (argc != 3) {
fprintf(stderr, "Usage: %s regex string\n", argv[0]);
return 1;
}
setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */
if ((ret = regcomp(®, argv[1], 0)) != 0) {
char buf[256];
regerror(ret, ®, buf, sizeof(buf));
fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
return 1;
}
if ((ret = regexec(®, argv[2], 10, matches, 0)) == 0) {
int i;
char buf[256];
int size;
for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
if (matches[i].rm_so == -1) break;
size = matches[i].rm_eo - matches[i].rm_so;
if (size >= sizeof(buf)) {
fprintf(stderr, "match (%d-%d) is too long (%d)\n",
matches[i].rm_so, matches[i].rm_eo, size);
continue;
}
buf[size] = '\0';
printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
strncpy(buf, argv[2] + matches[i].rm_so, size));
}
}
return 0;
}
使用示例:
$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$
匹配结果的长度为两个字节,因为UTF-8中的西里尔字母需要这么多。
The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.
这篇关于POSIX regex.h提供unicode或基本上非ascii字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!