扫描仪的Java java.util.regex.MatchResult计数器问题 [英] Java java.util.regex.MatchResult counter problems with Scanner

查看:42
本文介绍了扫描仪的Java java.util.regex.MatchResult计数器问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用java.util.Scanner从大字符串中扫描所有出现的给定正则表达式.

I'm using a java.util.Scanner to scan all occurrences of a given regex from a big string.

Scanner sc = new Scanner(body);
sc.useDelimiter("");
String match = "";
while(match!=null)
{
    match = sc.findWithinHorizon(pattern, 0);
    if(match==null)break;
    MatchResult mr = sc.match();
    System.out.println("Match string: "+mr.group());
    System.out.println("Match string using indexes: "+body.substring(mr.start(),mr.end());
}

奇怪的是,经过一定数量的扫描后,group()方法返回正确的结果,而start()和end()方法返回错误的索引,例如扫描从文件开头重新开始. 正则表达式是多行的(我使用此正则表达式来发现行更改"\ r \ n | [\ n \ r \ u2028 \ u2029 \ u0085]").

The strange thing is that after a certain number of scans, group() method returns the correct occurrence while the start() and end() methods return wrong indexes like the scan has restarted from the beginning of the file. The regex is multiline (i use this regex to discover a line change "\r\n|[\n\r\u2028\u2029\u0085]").

您有什么提示吗?可能与水平"参数有关(我已经尝试过使用该值的差分组合)吗?

Do you have any hint? Could it be related to the "horizon" parameter (I've tried differend combinations for that value)?

有关更多详细信息,它似乎与文件的大小有关(超过1000个字符),大约1000后,计数器从0重新开始(例如,在1003:1020之后出现的第一个错误索引变为3:120).

For more details, it seems related to the dimension of the file (more than 1000 chars), after about 1000 the counter restart from 0 (e.g. the first wrong index occurrence after 1003:1020 becomes 3:120).

推荐答案

Scanner使用带有1024个字符的内部缓冲区.使用Pattern代替:

Scanner uses an internal buffer with 1024 characters. Use Pattern instead:

Matcher matcher = Pattern.compile(...).matcher(body);
while(matcher.find()) {
    int start = matcher.start();
}

这篇关于扫描仪的Java java.util.regex.MatchResult计数器问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆