regcomp的一个字符是什么?哪种多字节编码决定了这一点? [英] What does constitute one character for regcomp? Which multibyte encoding does determine this?
问题描述
regcomp
(来自glibc)是用于编译正则表达式的POSIX函数.
regcomp
(from glibc) is a POSIX function for compiling regular expressions.
int regcomp(regex_t *restrict preg, const char *restrict pattern,
int cflags);
正则表达式中有一些构造取决于单个字符的概念,例如[abc]
.
There are some constructions in regular expressions which depend on the idea of a single character, for example [abc]
.
如果在表达式中使用了多字节编码并且使用了多字节字母,则将其视为字节序列或多字节字母序列都将有所不同.
If a multibyte encoding is used and a multibyte letter is used in the expression, the interpretation would be different if it treated either as a byte-sequence or a sequence of multibyte letters.
在这里,我用grep
(在这方面不得与C函数regcomp
相同)说明这个想法:
Here I illustrate this idea with grep
(which must not be the same in this respect as the C function regcomp
):
$ { echo Г; echo Д; } | egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=C egrep '[Д]'
Г
Д
$
如果未设置任何特定的语言环境变量,则
LANG
是默认值,因此问题是:其中哪个变量会影响regcomp
的编码概念.
LANG
is the default value if any of the specific locale variables are not set, so the question is: which one of them would affect the regcomp
's idea about the encoding.
$ locale
LANG=ru_RU.utf8
LC_CTYPE="ru_RU.utf8"
LC_NUMERIC="ru_RU.utf8"
LC_TIME="ru_RU.utf8"
LC_COLLATE="ru_RU.utf8"
LC_MONETARY="ru_RU.utf8"
LC_MESSAGES=POSIX
LC_PAPER="ru_RU.utf8"
LC_NAME="ru_RU.utf8"
LC_ADDRESS="ru_RU.utf8"
LC_TELEPHONE="ru_RU.utf8"
LC_MEASUREMENT="ru_RU.utf8"
LC_IDENTIFICATION="ru_RU.utf8"
LC_ALL=
$
推荐答案
与grep
一样(其行为不得与regcomp
相同),似乎很荣幸为此决定使用LC_CTYPE
:
As for grep
(which must not have the same behavior as regcomp
), it seems to honor LC_CTYPE
for this decision:
$ { echo Г; echo Д; } | LANG=en_US.utf8 egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=en_US.utf8 LC_COLLATE=C egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=en_US.utf8 LC_CTYPE=C egrep '[Д]'
Г
Д
$
这篇关于regcomp的一个字符是什么?哪种多字节编码决定了这一点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!