regcomp的一个字符是什么?哪种多字节编码决定了这一点? [英] What does constitute one character for regcomp? Which multibyte encoding does determine this?

查看：141 发布时间：2020/5/29 18:49:08 regex posix locale glibc multibyte

本文介绍了regcomp的一个字符是什么?哪种多字节编码决定了这一点?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

regcomp(来自glibc)是用于编译正则表达式的POSIX函数.

regcomp (from glibc) is a POSIX function for compiling regular expressions.

     int regcomp(regex_t *restrict preg, const char *restrict pattern,
     int cflags);

正则表达式中有一些构造取决于单个字符的概念，例如[abc].

There are some constructions in regular expressions which depend on the idea of a single character, for example [abc].

如果在表达式中使用了多字节编码并且使用了多字节字母，则将其视为字节序列或多字节字母序列都将有所不同.

If a multibyte encoding is used and a multibyte letter is used in the expression, the interpretation would be different if it treated either as a byte-sequence or a sequence of multibyte letters.

在这里，我用grep(在这方面不得与C函数regcomp相同)说明这个想法:

Here I illustrate this idea with grep (which must not be the same in this respect as the C function regcomp):

$ { echo Г; echo Д; } | egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=C egrep '[Д]'
Г
Д
$

如果未设置任何特定的语言环境变量，则

LANG是默认值，因此问题是:其中哪个变量会影响regcomp的编码概念.

LANG is the default value if any of the specific locale variables are not set, so the question is: which one of them would affect the regcomp's idea about the encoding.

$ locale
LANG=ru_RU.utf8
LC_CTYPE="ru_RU.utf8"
LC_NUMERIC="ru_RU.utf8"
LC_TIME="ru_RU.utf8"
LC_COLLATE="ru_RU.utf8"
LC_MONETARY="ru_RU.utf8"
LC_MESSAGES=POSIX
LC_PAPER="ru_RU.utf8"
LC_NAME="ru_RU.utf8"
LC_ADDRESS="ru_RU.utf8"
LC_TELEPHONE="ru_RU.utf8"
LC_MEASUREMENT="ru_RU.utf8"
LC_IDENTIFICATION="ru_RU.utf8"
LC_ALL=
$

推荐答案

与grep一样(其行为不得与regcomp相同)，似乎很荣幸为此决定使用LC_CTYPE:

As for grep (which must not have the same behavior as regcomp), it seems to honor LC_CTYPE for this decision:

$ { echo Г; echo Д; } | LANG=en_US.utf8 egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=en_US.utf8 LC_COLLATE=C egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=en_US.utf8 LC_CTYPE=C egrep '[Д]'
Г
Д
$

这篇关于regcomp的一个字符是什么?哪种多字节编码决定了这一点?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

regcomp的一个字符是什么?哪种多字节编码决定了这一点? [英] What does constitute one character for regcomp? Which multibyte encoding does determine this?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

regcomp的一个字符是什么?哪种多字节编码决定了这一点? [英] What does constitute one character for regcomp? Which multibyte encoding does determine this?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭