穷人的＆QUOT;词法＆QUOT;对于C＃ [英] Poor man's "lexer" for C#

查看：104 发布时间：2016/8/26 19:49:15 c# regex lexer

本文介绍了穷人的＆QUOT;词法＆QUOT;对于C＃的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想C＃编写一个非常简单的解析器。

I'm trying to write a very simple parser in C#.

我需要一个词法分析器 - 这让与令牌我交往定期EX pressions，所以它在regexs读取并给我回符号

I need a lexer -- something that lets me associate regular expressions with tokens, so it reads in regexs and gives me back symbols.

好像我应该能够使用正则表达式来进行实际的繁重，但我不能看到一个简单的方法来做到这一点。一方面，正则表达式似乎只对字符串工作，而不是流（这是为什么！？！？）。

It seems like I ought to be able to use Regex to do the actual heavy lifting, but I can't see an easy way to do it. For one thing, Regex only seems to work on strings, not streams (why is that!?!?).

基本上，我想下面的接口实现：

Basically, I want an implementation of the following interface:

interface ILexer : IDisposable
{
    /// <summary>
    /// Return true if there are more tokens to read
    /// </summary>
    bool HasMoreTokens { get; }
    /// <summary>
    /// The actual contents that matched the token
    /// </summary>
    string TokenContents { get; }
    /// <summary>
    /// The particular token in "tokenDefinitions" that was matched (e.g. "STRING", "NUMBER", "OPEN PARENS", "CLOSE PARENS"
    /// </summary>
    object Token { get; }
    /// <summary>
    /// Move to the next token
    /// </summary>
    void Next();
}

interface ILexerFactory
{
    /// <summary>
    /// Create a Lexer for converting a stream of characters into tokens
    /// </summary>
    /// <param name="reader">TextReader that supplies the underlying stream</param>
    /// <param name="tokenDefinitions">A dictionary from regular expressions to their "token identifers"</param>
    /// <returns>The lexer</returns>
    ILexer CreateLexer(TextReader reader, IDictionary<string, object> tokenDefinitions);
}

所以，pluz发送codz ...

不，说真的，我要开始写上述接口的实现但我觉得很难相信，有没有.NET（2.0），这样做已经是一些简单的方法。

So, pluz send the codz...
No, seriously, I am about to start writing an implementation of the above interface yet I find it hard to believe that there isn't some simple way of doing this in .NET (2.0) already.

所以，一个简单的方法有什么建议上面做？（另外，我不希望任何code发电机的性能是不是这件事情很重要，我不想引入任何复杂到构建过程。）

So, any suggestions for a simple way to do the above? (Also, I don't want any "code generators". Performance is not important for this thing and I don't want to introduce any complexity into the build process.)

推荐答案

最初的版本我张贴在这里作为一个答案了，因为它只有在有匹配当前前$ P不止一个正则表达式的工作问题$ pssion。也就是说，一旦只有一个正则表达式匹配，它会返回一个道理 - 而大多数人想要的正则表达式是贪婪。这是特别的东西的情况下，如援引字符串。

The original version I posted here as an answer had a problem in that it only worked while there was more than one "Regex" that matched the current expression. That is, as soon as only one Regex matched, it would return a token - whereas most people want the Regex to be "greedy". This was especially the case for things such as "quoted strings".

这对正则表达式的顶部坐在唯一的解决办法是读取输入行由行（这意味着你不能有一个跨越多行令牌）。我可以用这个活 - 这毕竟是一个穷人的词法分析器！此外，它通常是非常有用得到行号信息从词法分析器在任何情况下。

The only solution that sits on top of Regex is to read the input line-by-line (which means you cannot have tokens that span multiple lines). I can live with this - it is, after all, a poor man's lexer! Besides, it's usually useful to get line number information out of the Lexer in any case.

所以，下面是解决这些问题的新版本。信贷此亦这个

So, here's a new version that addresses these issues. Credit also goes to this

public interface IMatcher
{
    /// <summary>
    /// Return the number of characters that this "regex" or equivalent
    /// matches.
    /// </summary>
    /// <param name="text">The text to be matched</param>
    /// <returns>The number of characters that matched</returns>
    int Match(string text);
}

sealed class RegexMatcher : IMatcher
{
    private readonly Regex regex;
    public RegexMatcher(string regex)
    {
        this.regex = new Regex(string.Format("^{0}", regex));
    }

    public int Match(string text)
    {
        var m = regex.Match(text);
        return m.Success ? m.Length : 0;
    }

    public override string ToString()
    {
        return regex.ToString();
    }
}

public sealed class TokenDefinition
{
    public readonly IMatcher Matcher;
    public readonly object Token;

    public TokenDefinition(string regex, object token)
    {
        this.Matcher = new RegexMatcher(regex);
        this.Token = token;
    }
}

public sealed class Lexer : IDisposable
{
    private readonly TextReader reader;
    private readonly TokenDefinition[] tokenDefinitions;

    private string lineRemaining;

    public Lexer(TextReader reader, TokenDefinition[] tokenDefinitions)
    {
        this.reader = reader;
        this.tokenDefinitions = tokenDefinitions;
        nextLine();
    }

    private void nextLine()
    {
        do
        {
            lineRemaining = reader.ReadLine();
            ++LineNumber;
            Position = 0;
        } while (lineRemaining != null && lineRemaining.Length == 0);
    }

    public bool Next()
    {
        if (lineRemaining == null)
            return false;
        foreach (var def in tokenDefinitions)
        {
            var matched = def.Matcher.Match(lineRemaining);
            if (matched > 0)
            {
                Position += matched;
                Token = def.Token;
                TokenContents = lineRemaining.Substring(0, matched);
                lineRemaining = lineRemaining.Substring(matched);
                if (lineRemaining.Length == 0)
                    nextLine();

                return true;
            }
        }
        throw new Exception(string.Format("Unable to match against any tokens at line {0} position {1} \"{2}\"",
                                          LineNumber, Position, lineRemaining));
    }

    public string TokenContents { get; private set; }

    public object Token { get; private set; }

    public int LineNumber { get; private set; }

    public int Position { get; private set; }

    public void Dispose()
    {
        reader.Dispose();
    }
}

程序范例：

string sample = @"( one (two 456 -43.2 "" \"" quoted"" ))";

var defs = new TokenDefinition[]
{
    // Thanks to [steven levithan][2] for this great quoted string
            // regex
    new TokenDefinition(@"([""'])(?:\\\1|.)*?\1", "QUOTED-STRING"),
    // Thanks to http://www.regular-expressions.info/floatingpoint.html
    new TokenDefinition(@"[-+]?\d*\.\d+([eE][-+]?\d+)?", "FLOAT"),
    new TokenDefinition(@"[-+]?\d+", "INT"),
    new TokenDefinition(@"#t", "TRUE"),
    new TokenDefinition(@"#f", "FALSE"),
    new TokenDefinition(@"[*<>\?\-+/A-Za-z->!]+", "SYMBOL"),
    new TokenDefinition(@"\.", "DOT"),
    new TokenDefinition(@"\(", "LEFT"),
    new TokenDefinition(@"\)", "RIGHT"),
    new TokenDefinition(@"\s", "SPACE")
};

TextReader r = new StringReader(sample);
Lexer l = new Lexer(r, defs);
while (l.Next())
{
    Console.WriteLine("Token: {0} Contents: {1}", l.Token, l.TokenContents);
}

输出：

Token: LEFT Contents: (
Token: SPACE Contents:
Token: SYMBOL Contents: one
Token: SPACE Contents:
Token: LEFT Contents: (
Token: SYMBOL Contents: two
Token: SPACE Contents:
Token: INT Contents: 456
Token: SPACE Contents:
Token: FLOAT Contents: -43.2
Token: SPACE Contents:
Token: QUOTED-STRING Contents: " \" quoted"
Token: SPACE Contents:
Token: RIGHT Contents: )
Token: RIGHT Contents: )

这篇关于穷人的＆QUOT;词法＆QUOT;对于C＃的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

穷人的＆QUOT;词法＆QUOT;对于C＃ [英] Poor man's "lexer" for C#

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

穷人的＆QUOT;词法＆QUOT;对于C＃ [英] Poor man&#39;s &quot;lexer&quot; for C#

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

穷人的＆QUOT;词法＆QUOT;对于C＃ [英] Poor man's "lexer" for C#

登录关闭