解析空格分隔文本的最佳方法 [英] Best way to parse Space Separated Text

查看:148
本文介绍了解析空格分隔文本的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有串这样的

  / C SomeText\MoreText有些Text\More Text\LolSomeText 

我要来标记它,但是我不能只拆分的空间。我拿出有点丑陋分析器,它的工作原理,但我想知道如果任何人有一个更优雅的设计。



这是在C#BTW。



编辑:我的丑版本,而丑,为O(N)和实际上可能比使用正则表达式更快

 私人的String []记号化(字符串输入)
{
的String []标记= input.Split('');
名单,LT;弦乐>输出=新的List<串GT;();

的for(int i = 0; I< tokens.Length;我++)
{
如果(令牌[I] .StartsWith(\))
{
字符串TEMP =令牌[I];
时int k = 0;
为(K = 1 + 1; K< tokens.Length; k ++)
{
如果(令牌[K] .EndsWith(\))
{
+温度=+令牌[K];
中断;
}
,否则
{
+温度=+令牌[K];
}
}
output.Add(TEMP);
I = K + 1;
}
,否则
{
output.Add(令牌[I]);
}
}

返回output.ToArray();
}


解决方案

计算机术语,你正在做的是词法分析;读了这个共同任务的一个很好的总结。



根据你的榜样,我猜你想空格分开你的话,但东西在引号应该。被视为一个字不带引号



要做到这一点最简单的方法是定义一个字作为一个正则表达式:

 ([^^ \s] +)\s * |([^] +)\s * 

这表情指出,一个字是(1)无报价,非空白的文本由空格包围,或(2)用双引号(其次是一些空格)包围非引用文字。注意使用捕获括号突出显示所需的文本



武装与正则表达式,你的算法很简单:搜索您的文本由定义的下一个字捕获括号,并将其返回。重复,直到你用完了的话。



下面的工作代码我可以想出,VB.NET中最简单的一点。请注意,我们要检查的两个的数据组,因为有两套捕获括号的。

 昏暗的令牌作为字符串
昏暗 - [R作为正则表达式=新正则表达式(([^^ \s] +)\s * |([^] +),\s * )
昏暗米作为匹配= r.Match(这是测试字符串)

在m.Success
标记= m.Groups(1 )的ToString
。如果token.length = 0和m.Groups.Count> 1然后
标记= m.Groups(2)的ToString
端如果
M = m.NextMatch
端虽然

注1:威尔的答案,上面是一样的主意,因为这一个。希望这个答案说明了现场好一点背后的细节:)


I have string like this

 /c SomeText\MoreText "Some Text\More Text\Lol" SomeText

I want to tokenize it, however I can't just split on the spaces. I've come up with somewhat ugly parser that works, but I'm wondering if anyone has a more elegant design.

This is in C# btw.

EDIT: My ugly version, while ugly, is O(N) and may actually be faster than using a RegEx.

private string[] tokenize(string input)
{
    string[] tokens = input.Split(' ');
    List<String> output = new List<String>();

    for (int i = 0; i < tokens.Length; i++)
    {
        if (tokens[i].StartsWith("\""))
        {
            string temp = tokens[i];
            int k = 0;
            for (k = i + 1; k < tokens.Length; k++)
            {
                if (tokens[k].EndsWith("\""))
                {
                    temp += " " + tokens[k];
                    break;
                }
                else
                {
                    temp += " " + tokens[k];
                }
            }
            output.Add(temp);
            i = k + 1;
        }
        else
        {
            output.Add(tokens[i]);
        }
    }

    return output.ToArray();            
}

解决方案

The computer term for what you're doing is lexical analysis; read that for a good summary of this common task.

Based on your example, I'm guessing that you want whitespace to separate your words, but stuff in quotation marks should be treated as a "word" without the quotes.

The simplest way to do this is to define a word as a regular expression:

([^"^\s]+)\s*|"([^"]+)"\s*

This expression states that a "word" is either (1) non-quote, non-whitespace text surrounded by whitespace, or (2) non-quote text surrounded by quotes (followed by some whitespace). Note the use of capturing parentheses to highlight the desired text.

Armed with that regex, your algorithm is simple: search your text for the next "word" as defined by the capturing parentheses, and return it. Repeat that until you run out of "words".

Here's the simplest bit of working code I could come up with, in VB.NET. Note that we have to check both groups for data since there are two sets of capturing parentheses.

Dim token As String
Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*")
Dim m As Match = r.Match("this is a ""test string""")

While m.Success
    token = m.Groups(1).ToString
    If token.length = 0 And m.Groups.Count > 1 Then
        token = m.Groups(2).ToString
    End If
    m = m.NextMatch
End While

Note 1: Will's answer, above, is the same idea as this one. Hopefully this answer explains the details behind the scene a little better :)

这篇关于解析空格分隔文本的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆