使用 nom 5.0 解析数字 [英] Parsing number with nom 5.0

查看:32
本文介绍了使用 nom 5.0 解析数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Nom 5.0 解析一个大文件(数十 GB)流.解析器的一部分尝试解析数字:

I'm trying to parse a large file (tens of GB) streaming using Nom 5.0. One piece of the parser tries to parse numbers:

use nom::IResult;
use nom::character::streaming::{char, digit1};
// use nom::character::complete::{char, digit1};
use nom::combinator::{map, opt};
use nom::multi::many1;
use nom::sequence::{preceded, tuple};

pub fn number(input: &str) -> IResult<&str, &str> {
    map(
        tuple((
            opt(char('-')),
            many1(digit1),
            opt(preceded(char('.'), many1(digit1)))
        )),
        |_| "0"
    )(input)
}

(显然,它不应该为所有数字返回0";这只是为了使函数尽可能简单.)对于这个解析器,我写了一个测试:

(Obviously, it should not return "0" for all number; that's just to make the function as simple as possible.) For this parser, I wrote a test:

#[test]
fn match_positive_integer() {
    let (_, res) = number("0").unwrap();
    assert_eq!("0", res);
}

此测试因 Incomplete(Size(1)) 而失败,因为小数"opt() 想要读取数据,但它不存在.如果我切换到匹配器的 complete 版本(如注释掉的行),则测试通过.

This test fails with Incomplete(Size(1)) because the "decimals" opt() wants to read data and it isn't there. If I switch to the complete versions of the matchers (as commented-out line), the test passes.

我认为这实际上会在生产中起作用,因为在抱怨不完整时会提供额外的数据,但我仍然想创建单元测试.此外,如果一个数字恰好是文件中输入的最后一位,那么生产中就会出现该问题.我如何说服流式 Nom 解析器没有更多可用数据?

I assume this will actually work in production, because it will be fed additional data when complaining about incompleteness, but I would still like to create unit tests. Additionally, the issue would occur in production if a number happened to be the very last bit of input in a file. How do I convince a streaming Nom parser that there is no more data available?

推荐答案

人们可以争辩说原始形式的测试是正确的:解析器无法确定给定的输入是否为数字,因此解析-结果其实还没有定论.在生产中,尤其是在像您一样读取大文件时,已读取但要解析的字节的缓冲区可能正好位于可能 之间的数字之间,除非它实际上不是.然后,解析器需要保留其当前状态并请求更多输入,以便它可以重试/继续.将 Incomplete 视为最终错误,而不是将其视为 我什至不知道:这可能是一个错误,具体取决于下一个字节,这个问题目前尚无法确定!.

One can argue that the test in its original form is correct: The parser can't decide whether the given input is a number or not, so the parsing-result is in fact undecided yet. In production, especially when reading large files as you do, the buffer of already-read-but-to-be-parsed bytes might end right in between what could be a number unless it's actually not. Then, the parser needs to preserve its current state and ask for more input so it can retry/continue. Think of Incomplete not as a final error but as I don't even know: This could be an error depending on the next byte, this problem is undecidable as of yet!.

您可以使用完整-combinator 在你的顶级解析器,所以当你确实达到了 EOF 时,你会出错.Incomplete-results within 顶级解析器应该被处理,例如通过将读取缓冲区扩展一些余量并重试.

You can use the complete-combinator on your top-level parser so when you do in fact reach EOF, you error out on that. Incomplete-results within the top-level parser should be handled e.g. by expanding the read-buffer by some margin and retrying.

您可以将解析器包装在当前单元测试本地的 complete() 解析器中并对其进行测试.有什么要注意的

You can wrap the parser in a complete()-parser local to the current unittest and test on that. Something to the tune of

#[test]
fn match_positive_integer() {
    let (_, res) = complete(number("0")).unwrap();
    assert_eq!("0", res);
}

这篇关于使用 nom 5.0 解析数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆