如何使Flex(词法扫描仪)读取UTF-8字符输入? [英] How to make a flex (lexical scanner) to read UTF-8 characters input?

查看：275 发布时间：2020/7/13 3:21:47 utf-8 lexical-analysis flex-lexer

本文介绍了如何使Flex(词法扫描仪)读取UTF-8字符输入?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

flex 似乎不支持UTF-8输入.每当扫描程序遇到非ASCII字符时，它就会像EOF一样停止扫描.

It seems that flex doesn't support UTF-8 input. Whenever the scanner encounter a non-ASCII char, it stops scanning as if it was an EOF.

有没有一种方法可以强迫flex吃掉我的UTF-8字符?我不希望它实际上与UTF-8字符匹配，只是在使用'.'时将它们吃掉.模式.

Is there a way to force flex to eat my UTF-8 chars? I don't want it to actually match UTF-8 chars, just eat them when using the '.' pattern.

有什么建议吗?

编辑

最简单的解决方案是:

任何[\ x00- \ xff]

ANY [\x00-\xff]

，并使用"ANY"代替.".在我的规则中.

and use 'ANY' instead of '.' in my rules.

推荐答案

我本人一直在研究此问题，并阅读了Flex邮件列表以查看是否有人考虑过.要让Flex读取unicode是一件复杂的事情...

I have been looking into this myself and reading the Flex mailing list to see if anyone thought about it. To get Flex to read unicode is a complex affair ...

可以完成UTF-8编码，大多数其他编码(16进制)将导致大量表驱动自动机.

UTF-8 encoding can be done, and most other encodings (the 16s) will lead to massive tables driving the automata.

到目前为止，常见的方法是:

A common method so far is:

我所做的只是简单地编写与单个UTF-8匹配的模式人物.他们看起来像以下内容，但您可能想要重新阅读UTF-8规范因为我是很久以前写的.
您当然需要结合这些是因为您想要unicode字符串，不只是单个字符.

What I did was simply write patterns that match single UTF-8 characters. They look something like the following, but you might want to re-read the UTF-8 specification because I wrote this so long ago.
You will of course need to combine these since you want unicode strings, not just single characters.

UB [\200-\277] %% 
[\300-\337]{UB}                   { do something } 
[\340-\357]{UB}{2}                { do something } 
[\360-\367]{UB}{3}                { do something } 
[\370-\373]{UB}{4}                { do something } 
[\374-\375]{UB}{5}                { do something }

从邮件列表中提取.

在进一步研究之后，我可能会考虑为UTF-8支持创建一个适当的补丁.对于大型.l文件，上述解决方案似乎无法维护.而且真的很丑！您可以使用类似的范围来创建.".替换规则以匹配所有ASCII和UTF-8字符，但仍然很难看.

I may look at creating a proper patch for UTF-8 support after looking at it further. The above solution seems unmaintainable for large .l files. And is really ugly! You could use ranges similar to create a '.' substitute rule to match all ASCII and UTF-8 characters, but still rather ugly.

希望这会有所帮助！

这篇关于如何使Flex(词法扫描仪)读取UTF-8字符输入?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使Flex(词法扫描仪)读取UTF-8字符输入? [英] How to make a flex (lexical scanner) to read UTF-8 characters input?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使Flex(词法扫描仪)读取UTF-8字符输入? [英] How to make a flex (lexical scanner) to read UTF-8 characters input?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭