标记化int与lex/flex中的float [英] tokenizing ints vs floats in lex/flex

查看：48 发布时间：2021/5/29 20:30:51 regex lex

本文介绍了标记化int与lex/flex中的float的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在自学一些屈曲/野牛的乐趣.我正在为1975年版的MS Extended BASIC(扩展为具有字符串")编写解释器.不过，我对一个问题感到有些困惑.

可以通过查找.或 E (等)来标识浮点数，然后将其故障转移到int.所以我做到了...

  [0-9] * [0-9.] [0-9] *([Ee] [-+]?[0-9] +)?{yylval.d = atof(yytext);返回FLOAT；}[0-9] + {yylval.i = atoi(yytext);返回INT;}

yylval联合中的

子字段是.d表示双精度，.i表示int以及.s表示字符串.

但是您也可能需要使用浮点数，因为该数字太大而无法存储在int中(在本例中为16位带符号整数).

有没有办法在正则表达式中执行此操作?还是我必须在关联的C端代码中使用if来做到这一点?

解决方案

如果您希望整数优先于浮点数(这样看起来像整数的文字就是整数)，则需要先将整数模式放在首位.(具有最长匹配的模式始终会获胜，但是如果两个模式都匹配相同的最长前缀，则第一个将获胜.)因此，您的基本轮廓是:

 整数模式{/*整数规则*/}浮动模式 {/* 浮动规则 */}

您的浮动规则看起来很合理，但请注意，它将与单个.匹配，并且可能后面跟一个指数.很少有语言将孤独的.作为浮点常量(该文字通常按 0 :-)进行写)，因此您可能希望将其更改为类似的内容

  [0-9] *([0-9] \.?| \.[0-9])[0-9] *([Ee] [-+]?[0-9]+)

要使用正则表达式匹配适合16位带符号 int 的非负整数，可以使用以下丑陋模式:

  0 *([12]?[0-9] {1,4} | 3(2(7(6 [0-7] | [0-5] [0-9]))| [0-6] [0-9] {2})| [0-1] [0-9] {3}))

(F)lex将产生有效的代码来实现此正则表达式，但这不一定会使它成为一个好主意.

注意:

该模式识别带有冗余前导零的整数，例如 09.某些语言(例如C)认为这是无效的八进制文字，但我不认为Basic具有该限制.
该模式无法识别32768，因为它太大了，无法成为正整数.但是，它不能太大，不能为负整数.-32768 就可以了.在解析整数文字时，这总是一个极端的情况.如果您只是对整数文字进行词法分析，则可以通过以-开头的文字形式使用单独的模式来轻松处理正负限制之间的差异，但是在整数文字中包含符号不适合表达式解析器，因为它会产生 a-1 的错误词法分析.(对于 -32768 是有效的整数文字，而将 -32768 作为浮点表达式进行分析，其结果为 -32768.0，这也会有些奇怪.)除非您的语言包含无符号整数文字(例如C)，否则这里真的没有好的解决方案，在这种情况下，您可以将0到32767的文字分析为有符号整数；从32768到65535(无符号整数)；从65536及以上作为浮点.

I'm teaching myself a little flex/bison for fun. I'm writing an interpreter for the 1975 version of MS Extended BASIC (Extended as in "has strings"). I'm slightly stumped by one issue though.

Floats can be identified by looking for a . or an E (etc), and then fail over to an int otherwise. So I did this...

[0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
                      yylval.d = atof(yytext);
                      return FLOAT;
                    }
[0-9]+ {
                      yylval.i = atoi(yytext);
                      return INT;
                    }

sub-fields in the yylval union are .d for double, .i for int and .s for string.

But it is also possible that you need to use a float because the number is too large to store in an int - which in this case is a 16-bit signed.

Is there a way to do this in the regex? Or do I have to do this in the associated c-side code with an if?

解决方案

If you want integer to take priority over float (so that a literal which looks like an integer is an integer), then you need to put the integer pattern first. (The pattern with the longest match always wins, but if two patterns both match the same longest prefix, the first one wins.) So your basic outline is:

integer-pattern     { /* integer rule */ }
float-pattern       { /* float rule */ }

Your float rule looks reasonable, but note that it will match a single ., possibly followed by an exponent. Very few languages consider a lone . as a floating point constant (that literal is conventionally written as 0 :-) ) So you might want to change it to something like

[0-9]*([0-9]\.?|\.[0-9])[0-9]*([Ee][-+]?[0-9]+)

To use a regex to match a non-negative integer which fits into a 16-bit signed int, you can use the following ugly pattern:

0*([12]?[0-9]{1,4}|3(2(7(6[0-7]|[0-5][0-9])|[0-6][0-9]{2})|[0-1][0-9]{3}))

(F)lex will produce efficient code to implement this regex, but that doesn't necessarily make it a good idea.

Notes:

The pattern recognises integers with redundant leading zeros, like 09. Some languages (like C) consider that to be an invalid octal literal, but I don't think Basic has that restriction.
The pattern does not recognise 32768, since that's too big to be a positive integer. However, it is not too big to be a negative integer; -32768 would be perfectly fine. This is always a corner case in parsing integer literals. If you were just lexing integer literals, you could easily handle the difference between positive and negative limits by having a separate pattern for literals starting with a -, but including the sign in the integer literal is not appropriate for expression parsers, since it produces an incorrect lexical analysis of a-1. (It would also be a bit weird for -32768 to be a valid integer literal, while - 32768 is analysed as a floating point expression which evaluates to -32768.0.) There's really no good solution here, unless your language includes unsigned integer literals (like C), in which case you could analyse literals from 0 to 32767 as signed integers; from 32768 to 65535 as unsigned integers; and from 65536 and above as floating point.

这篇关于标记化int与lex/flex中的float的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

标记化int与lex/flex中的float [英] tokenizing ints vs floats in lex/flex

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

标记化int与lex/flex中的float [英] tokenizing ints vs floats in lex/flex

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭