如何在C中lex unicode字符? [英] How do I lex unicode characters in C?

查看:111
本文介绍了如何在C中lex unicode字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经用C语言编写了一个词法分析器(Lexer),目前它可以成功地以ASCII语言对文件进行词法分析,但是我对如何对Unicode进行词法分析感到困惑.我需要使用什么Unicode编码,例如我是否应该支持utf-8,utf-16等?Rust或Go等语言支持什么?

如果可以的话,有没有什么图书馆可以帮到我,尽管我更愿意自己尝试一下,这样我才能学习.即使那样,我可以阅读并学习的一个小型图书馆也是很棒的.

解决方案

已经有lex版本(和其他支持UniCode的lexer工具),它们在WikiPedia页面上列出: Wikipedia解析器页面上,还有一个词法分析器工具列表.总之,以下工具可处理UniCode:

  • JavaCC -JavaCC生成用Java编写的词法分析器.
  • JFLex -Java的词法分析器生成器.
  • Quex -用于C和C ++的快速通用词法分析器生成器.
  • FsLex -用于F#的字节和Unicode字符输入的词法生成器

当然,还有W3.org使用的技术和@jim mcnamara在 http://www.w3.org/2005/03/23-lex-U .

您说您已经在 C 中编写了自己的词法分析器,但是您对名为lex的工具使用了标签lex;也许这是一个疏忽?

在注释中,您说您没有使用正则表达式,但也想学习.学习有关语言识别理论的知识是编写高效且有效的词法分析器的关键.被识别的符号被分类为 Chomsky Type 3语言 ,或者 常规语言 ,可由 有限状态机 ). 有限状态机的标准实现由包含开关的循环编码.最有经验的编码人员应该知道,并能够识别和利用这种形式:

while ( not <<EOF>> ) {
  switch ( input_symbol ) {
    case ( state_symbol[0] ) :
         ...
    case ( state_symbol[1] ) :

        ...
    default:
        ....
   }
}

如果您以这种方式编码,则无论所处理的符号是8位还是16位,相同的编码都可以简单地起作用,因为算法编码模式保持不变.

不了解底层理论和实践的词法分析器的Ad-Hoc编码最终将受到限制.我认为您会发现对这一领域有更多了解是有益的.

I've written a Lexer in C, it currently lexes files in ASCII successfully, however I'm confused as to how I would lex unicode. What unicode would I need to lex, for instance should I support utf-8, utf-16, etc. What do languages like Rust or Go support?

If so are there any libraries that can help me out, although I would prefer to try and do it myself so I can learn. Even then, a small library that I could read to learn from would be great.

解决方案

There are already version of lex (and other lexer tools that support UniCode) and they are tabulated on the WikiPedia Page: List of Lexer Generators. There is also a list of lexer tools on the Wikipedia Parser Page. In summary, the following tools handle UniCode:

  • JavaCC - JavaCC generates lexical analyzers written in Java.
  • JFLex - A lexical analyzer generator for Java.
  • Quex - A fast universal lexical analyzer generator for C and C++.
  • FsLex - A lexer generator for byte and Unicode character input for F#

And, of course, there are the techniques used by W3.org and cited by @jim mcnamara at http://www.w3.org/2005/03/23-lex-U.

You say you have written your own lexer in C, but you have used the tag lex for the tool called lex; perhaps that was an oversight?

In the comments you say you have not used regular expressions, but also want to learn. Learning something about the theory of language recognition is key to writing an efficient and working lexer. The symbols being recognised are classified as a Chomsky Type 3 Language, or a Regular Language, which can be described by Regular Expressions. Regular Expressions can be implemented by coding that implements a Finite State Automata (or Finite State Machine). The standard implementation for a finite state machine is coded by a loop containing a switch. Most experienced coders should know, and be able to recognise and exploit this form:

while ( not <<EOF>> ) {
  switch ( input_symbol ) {
    case ( state_symbol[0] ) :
         ...
    case ( state_symbol[1] ) :

        ...
    default:
        ....
   }
}

If you had coded in this style, the same coding could simply work whether the symbols being handled were 8 bit or 16 bit, as the algorithmic coding pattern remains the same.

Ad-Hoc coding of a lexical analyser without an understanding of the underlying theory and practice will eventually have its limits. I think you will find it beneficial to read a little more into this area.

这篇关于如何在C中lex unicode字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆