是否POSIX regex.h提供单向code或基本非ASCII字符? [英] Does POSIX regex.h provide unicode or basically non-ascii characters?
问题描述
我使用正则表达式标准库(regcomp,regexec ..)。但现在需求,我应该给我的codeS添加定期EX pressions UNI code支持。
Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.
标准是否正则表达式库提供单code或基本非ASCII字符?我研究了在网络上,并认为不会。
Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.
我的项目是资源评论家因此我不想用大库它(ICU和Boost.Regex)。
My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).
任何帮助将是AP preciated ..
Any help would be appreciated..
推荐答案
看起来像POSIX正则表达式使用UTF-8语言环境正常工作。我刚写了一个简单的测试(见下文),并用它与反对正则表达式[[:字母:]]一西里尔字符匹配的字符串
(例如) 。所有的一切都工作得很好。
Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]"
(for example). And everything working just fine.
注意:你必须记住主要的东西 - 正则表达式功能区域有关。所以,你必须调用 setlocale()的前
了。
Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale()
before it.
#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>
int main(int argc, char** argv) {
int ret;
regex_t reg;
regmatch_t matches[10];
if (argc != 3) {
fprintf(stderr, "Usage: %s regex string\n", argv[0]);
return 1;
}
setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */
if ((ret = regcomp(®, argv[1], 0)) != 0) {
char buf[256];
regerror(ret, ®, buf, sizeof(buf));
fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
return 1;
}
if ((ret = regexec(®, argv[2], 10, matches, 0)) == 0) {
int i;
char buf[256];
int size;
for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
if (matches[i].rm_so == -1) break;
size = matches[i].rm_eo - matches[i].rm_so;
if (size >= sizeof(buf)) {
fprintf(stderr, "match (%d-%d) is too long (%d)\n",
matches[i].rm_so, matches[i].rm_eo, size);
continue;
}
buf[size] = '\0';
printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
strncpy(buf, argv[2] + matches[i].rm_so, size));
}
}
return 0;
}
用法示例:
$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$
匹配结果的长度为2个字节,因为UTF-8西里尔字母花费这么多。
The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.
这篇关于是否POSIX regex.h提供单向code或基本非ASCII字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!