POSIX regex.h提供unicode或基本上非ascii字符? [英] Does POSIX regex.h provide unicode or basically non-ascii characters?

查看:176
本文介绍了POSIX regex.h提供unicode或基本上非ascii字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我使用标准Regex库(regcomp,regexec ..)。但现在需求我应该添加unicode支持我的正则表达式的代码。

Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.

标准Regex库提供unicode或基本上是非ASCII字符吗?我在网上研究,不想。

Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.

我的项目是资源评论者,因此我不想使用大的库(ICU和Boost.Regex)。

My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).

任何帮助将不胜感激。

推荐答案

看起来POSIX Regex使用UTF-8语言环境正常工作。我只是写了一个简单的测试(见下文),并使用它匹配字符串与西里尔字符反对regex [[:alpha:]] 。一切都很好。

Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]" (for example). And everything working just fine.

注意:你必须记住的主要事情 - 正则表达式函数是与区域设置相关的。所以你必须在它之前调用 setlocale()

Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale() before it.

#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>

int main(int argc, char** argv) {
  int ret;
  regex_t reg;
  regmatch_t matches[10];

  if (argc != 3) {
    fprintf(stderr, "Usage: %s regex string\n", argv[0]);
    return 1;
  }

  setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */

  if ((ret = regcomp(&reg, argv[1], 0)) != 0) {
    char buf[256];
    regerror(ret, &reg, buf, sizeof(buf));
    fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
    return 1;
  }

  if ((ret = regexec(&reg, argv[2], 10, matches, 0)) == 0) {
    int i;
    char buf[256];
    int size;
    for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
      if (matches[i].rm_so == -1) break;
      size = matches[i].rm_eo - matches[i].rm_so;
      if (size >= sizeof(buf)) {
        fprintf(stderr, "match (%d-%d) is too long (%d)\n",
                matches[i].rm_so, matches[i].rm_eo, size);
        continue;
      }
      buf[size] = '\0';
      printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
             strncpy(buf, argv[2] + matches[i].rm_so, size));

    }
  }

  return 0;
}

使用示例:

$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$

匹配结果的长度为两个字节,因为UTF-8中的西里尔字母需要这么多。

The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.

这篇关于POSIX regex.h提供unicode或基本上非ascii字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆