Flex(lexer)对Unicode的支持 [英] Flex(lexer) support for unicode

查看：124 发布时间：2020/11/8 20:51:37 unicode flex-lexer

本文介绍了Flex(lexer)对Unicode的支持的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道最新版本的flex是否支持unicode?

如果是这样，如何使用模式来匹配汉字?

解决方案

目前，flex仅生成8位扫描程序，这基本上限制了您使用UTF-8.因此，如果您有一种模式:

肖晗   { printf ("xiaohan\n"); }

它将按预期工作，因为模式和输入中的字节顺序将相同.更加困难的是角色类.如果要匹配字符肖或晗，则不能写:

[肖晗]   { printf ("xiaohan/2\n"); }

因为这将匹配六个字节中的每个字节0xe8、0x82、0x96、0xe6、0x99和0x97，实际上这意味着如果提供肖晗作为输入，则模式将匹配六次.因此，在这种简单情况下，您必须将模式重写为(肖|晗).

对于范围，汉斯·阿伯格(Hans Aberg)已在其中编写了工具Haskell 将其转换为8位模式:

Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0

这不是很漂亮，但是应该可以.

I am wondering if the newest version of flex supports unicode?

If so, how can use patterns to match Chinese characters?

More: Use regular expression to match ANY Chinese character in utf-8 encoding

解决方案

At the moment, flex only generates 8-bit scanners which basically limits you to use UTF-8. So if you have a pattern:

肖晗   { printf ("xiaohan\n"); }

it will work as expected, as the sequence of bytes in the pattern and in the input will be the same. What's more difficult is character classes. If you want to match either the character 肖 or 晗, you can't write:

[肖晗]   { printf ("xiaohan/2\n"); }

because this will match each of the six bytes 0xe8, 0x82, 0x96, 0xe6, 0x99 and 0x97, which in practice means that if you supply 肖晗 as the input, the pattern will match six times. So in this simple case, you have to rewrite the pattern to (肖|晗).

For ranges, Hans Aberg has written a tool in Haskell that transforms these into 8-bit patterns:

Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0

This isn't pretty, but it should work.

这篇关于Flex(lexer)对Unicode的支持的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Flex(lexer)对Unicode的支持 [英] Flex(lexer) support for unicode

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Flex(lexer)对Unicode的支持 [英] Flex(lexer) support for unicode

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭