如何使Flex(词法扫描仪)读取UTF-8字符输入? [英] How to make a flex (lexical scanner) to read UTF-8 characters input?

查看:275
本文介绍了如何使Flex(词法扫描仪)读取UTF-8字符输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

flex 似乎不支持UTF-8输入.每当扫描程序遇到非ASCII字符时,它就会像EOF一样停止扫描.

It seems that flex doesn't support UTF-8 input. Whenever the scanner encounter a non-ASCII char, it stops scanning as if it was an EOF.

有没有一种方法可以强迫flex吃掉我的UTF-8字符?我不希望它实际上与UTF-8字符匹配,只是在使用'.'时将它们吃掉.模式.

Is there a way to force flex to eat my UTF-8 chars? I don't want it to actually match UTF-8 chars, just eat them when using the '.' pattern.

有什么建议吗?

编辑

最简单的解决方案是:

任何[\ x00- \ xff]

ANY [\x00-\xff]

,并使用"ANY"代替.".在我的规则中.

and use 'ANY' instead of '.' in my rules.

推荐答案

我本人一直在研究此问题,并阅读了Flex邮件列表以查看是否有人考虑过.要让Flex读取unicode是一件复杂的事情...

I have been looking into this myself and reading the Flex mailing list to see if anyone thought about it. To get Flex to read unicode is a complex affair ...

可以完成UTF-8编码,大多数其他编码(16进制)将导致大量表驱动自动机.

UTF-8 encoding can be done, and most other encodings (the 16s) will lead to massive tables driving the automata.

到目前为止,常见的方法是:

A common method so far is:

我所做的只是简单地编写与单个UTF-8匹配的模式 人物.他们看起来像 以下内容,但您可能想要 重新阅读UTF-8规范 因为我是很久以前写的.
您当然需要结合 这些是因为您想要unicode字符串, 不只是单个字符.

What I did was simply write patterns that match single UTF-8 characters. They look something like the following, but you might want to re-read the UTF-8 specification because I wrote this so long ago.
You will of course need to combine these since you want unicode strings, not just single characters.

UB [\200-\277] %% 
[\300-\337]{UB}                   { do something } 
[\340-\357]{UB}{2}                { do something } 
[\360-\367]{UB}{3}                { do something } 
[\370-\373]{UB}{4}                { do something } 
[\374-\375]{UB}{5}                { do something }

从邮件列表中提取.

在进一步研究之后,我可能会考虑为UTF-8支持创建一个适当的补丁.对于大型.l文件,上述解决方案似乎无法维护.而且真的很丑!您可以使用类似的范围来创建.".替换规则以匹配所有ASCII和UTF-8字符,但仍然很难看.

I may look at creating a proper patch for UTF-8 support after looking at it further. The above solution seems unmaintainable for large .l files. And is really ugly! You could use ranges similar to create a '.' substitute rule to match all ASCII and UTF-8 characters, but still rather ugly.

希望这会有所帮助!

这篇关于如何使Flex(词法扫描仪)读取UTF-8字符输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆