Flex(lexer)对Unicode的支持 [英] Flex(lexer) support for unicode
问题描述
我想知道最新版本的flex是否支持unicode? p>
如果是这样,如何使用模式来匹配汉字?
目前,flex仅生成8位扫描程序,这基本上限制了您使用UTF-8.因此,如果您有一种模式:
肖晗 { printf ("xiaohan\n"); }
它将按预期工作,因为模式和输入中的字节顺序将相同.更加困难的是角色类.如果要匹配字符肖或晗,则不能写:
[肖晗] { printf ("xiaohan/2\n"); }
因为这将匹配六个字节中的每个字节0xe8、0x82、0x96、0xe6、0x99和0x97,实际上这意味着如果提供肖晗
作为输入,则模式将匹配六次.因此,在这种简单情况下,您必须将模式重写为(肖|晗)
.
对于范围,汉斯·阿伯格(Hans Aberg)已在其中编写了工具Haskell 将其转换为8位模式:
Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0
这不是很漂亮,但是应该可以.
I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
More: Use regular expression to match ANY Chinese character in utf-8 encoding
At the moment, flex only generates 8-bit scanners which basically limits you to use UTF-8. So if you have a pattern:
肖晗 { printf ("xiaohan\n"); }
it will work as expected, as the sequence of bytes in the pattern and in the input will be the same. What's more difficult is character classes. If you want to match either the character 肖 or 晗, you can't write:
[肖晗] { printf ("xiaohan/2\n"); }
because this will match each of the six bytes 0xe8, 0x82, 0x96, 0xe6, 0x99 and 0x97, which in practice means that if you supply 肖晗
as the input, the pattern will match six times. So in this simple case, you have to rewrite the pattern to (肖|晗)
.
For ranges, Hans Aberg has written a tool in Haskell that transforms these into 8-bit patterns:
Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0
This isn't pretty, but it should work.
这篇关于Flex(lexer)对Unicode的支持的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!