是否POSIX regex.h提供单向code或基本非ASCII字符? [英] Does POSIX regex.h provide unicode or basically non-ascii characters?

查看:112
本文介绍了是否POSIX regex.h提供单向code或基本非ASCII字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用正则表达式标准库(regcomp,regexec ..)。但现在需求,我应该给我的codeS添加定期EX pressions UNI code支持。

Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.

标准是否正则表达式库提供单code或基本非ASCII字符?我研究了在网络上,并认为不会。

Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.

我的项目是资源评论家因此我不想用大库它(ICU和Boost.Regex)。

My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).

任何帮助将是AP preciated ..

Any help would be appreciated..

推荐答案

看起来像POSIX正则表达式使用UTF-8语言环境正常工作。我刚写了一个简单的测试(见下文),并用它与反对正则表达式[[:字母:]]一西里尔字符匹配的字符串(例如) 。所有的一切都工作得很好。

Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]" (for example). And everything working just fine.

注意:你必须记住主要的东西 - 正则表达式功能区域有关。所以,你必须调用 setlocale()的前了。

Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale() before it.

#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>

int main(int argc, char** argv) {
  int ret;
  regex_t reg;
  regmatch_t matches[10];

  if (argc != 3) {
    fprintf(stderr, "Usage: %s regex string\n", argv[0]);
    return 1;
  }

  setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */

  if ((ret = regcomp(&reg, argv[1], 0)) != 0) {
    char buf[256];
    regerror(ret, &reg, buf, sizeof(buf));
    fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
    return 1;
  }

  if ((ret = regexec(&reg, argv[2], 10, matches, 0)) == 0) {
    int i;
    char buf[256];
    int size;
    for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
      if (matches[i].rm_so == -1) break;
      size = matches[i].rm_eo - matches[i].rm_so;
      if (size >= sizeof(buf)) {
        fprintf(stderr, "match (%d-%d) is too long (%d)\n",
                matches[i].rm_so, matches[i].rm_eo, size);
        continue;
      }
      buf[size] = '\0';
      printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
             strncpy(buf, argv[2] + matches[i].rm_so, size));

    }
  }

  return 0;
}

用法示例:

$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$

匹配结果的长度为2个字节,因为UTF-8西里尔字母花费这么多。

The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.

这篇关于是否POSIX regex.h提供单向code或基本非ASCII字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆