如何像在Java中的词法分析器中那样标记字符串? [英] How to tokenize an String like in lexer in java?

查看:102
本文介绍了如何像在Java中的词法分析器中那样标记字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要根据词法分析器规则标记的代码行。

Lines of code which i want to tokenize according to lexer rules.

  String input1 = input.replaceAll("\\s+"," ");

     List<String> uncleanList = Arrays.asList(input1.split(" "));

我将此代码放在String中。
将所有多个空格替换为一个空格

I put this code in String. replaced all multi spaces into one space

String s = codeString.replaceAll("\\s+"," ");

然后

  String t= s.split(" ") 

该字符串上的方法给了我借助单个空格的数组。
i得到了这个数组结果(这是System.out.println(Arrays.toString(s))的控制台输出;):

method on that string which gave me an array with the help of single space. i got this Array result(This is console output by System.out.println(Arrays.toString(s)); ):

[String, input1, =, input.replaceAll("\\s+",", ");, List<String>, uncleanList, =, Arrays.asList(input1.split(", "));]

但有很多()括号,<>,。点等,不留空格。现在我被困在这里。如何在单独的索引上将符号与字母或数字分开。

but there are alot of ( ) brackets, <> , . dots "" etc without space separation. Now i am stuck here . How to separate symbols from alphabets or numbers on separate index .

在控制台上打印阵列时所需的阵列输出:

Desired Array output when print Array on console:

 [String, input1, =, input,.,replaceAll,(,"\\s+"," ",),;, List,<,String,>, uncleanList, =, Arrays,.,asList,(,input1,.,split,(," ",),),;,]   


推荐答案

当您没有定界符时, split 不再是进行令牌化的有效方法。不用使用split查找不需要的部分,而是使用 find 查找所需的部分,例如:

When you don't have a delimiter to use, split stops being an effective way to do tokenization. Instead of using split to find the parts you don't want, use find to find the parts you do want, like this:

Pattern pattern = Pattern.compile("\\w+|[+-]?[0-9\\._Ee]+|\\S");
Matcher matcher = pattern.matcher(input);

// Find all matches
while (matcher.find()) {
  String token = matcher.group();
}

我在这里提供的示例正则表达式更简单,您真正想要的就是。重要的是,您要提供默认模式(\S),以匹配较长匹配中不包含的任何非空白字符。

The example regex I provide here is simpler that what you really want. The important thing is that you provide the default pattern (\S) to match any non-whitespace character that isn't included in a longer match. That will take care of all the single-character tokens.

您必须匹配的一些较长的令牌(例如字符串和注释)非常复杂,因此需要花费很多时间。一些工作可以解决这个问题。

Some of the longer tokens you have to match, like strings and comments, are pretty complicated, so it will take some work to get this right.

这篇关于如何像在Java中的词法分析器中那样标记字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆