在java中查找非常大的格式化文本文件中的字符串 [英] Find a string in a very large formatted text file in java

查看:302
本文介绍了在java中查找非常大的格式化文本文件中的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是事情:
我有一个非常大的文本文件,它的格式如下:

Here is the thing: I have a really big text file and it has a format like this:

0007476|000011434982|00249626000|R|2008-01-11 00:00:00|9999-12-31 23:59:59|000019.99
0007476|000014017887|00313865000|R|2011-04-19 00:00:00|9999-12-31 23:59:59|000599.99
...
...

我需要查找文件中是否存在特定模式,例如

And I need to find if a particular pattern exists in the file, say

0007476|whatever|00313865000|whatever

我需要的是一个布尔表示是或否。
现在我所做的就是逐行读取文件并进行正则表达式匹配:

All I need is a boolean saying yes or no. Now what I have done is to read the file line by line and do a regular expression matching:

Pattern pattern = Pattern.compile(regex);
Scanner scanner = new Scanner(new File(fileName));
        String line;
        while (scanner.hasNextLine()) {
            line = scanner.nextLine();
            if (pattern.matcher(line).matches()) {
                scanner.close();
                return true;
            }
        }

正则表达式的格式为

"0007476\|\d{12}\|0031386500.*

此方法有效,但搜索远离起始行的字符串通常需要15秒。有更快的方法可以实现吗?谢谢

This method works, but it takes usually 15 seconds to search for a string that is far from the start line. Is there a faster way to achieve that? Thanks

推荐答案

我认为你需要扫描器因为文件太大了改为读入单个字符串

I assume that you need the Scanner because the file is too big to read into a single String instead?

如果的话,您可以使用直接找到匹配项的正则表达式。根据您是否关心行开头的特定文本,您可以使用以下内容:

If that is not the case, you can probably use a regular expression that finds the match directly. Depending on whether or not you care about the specific text at the start of the line you can you something along the lines of:

(?m)^ 0007476 \ | \d {12} \ | 0031386500。* $

如果由于内存使用需要将其分解成较小的块,我建议不要阅读o na每行基础,(因为线条相当短),但使用像 BufferedReader 这样的东西处理更大的块?

If you do need to break it up into smaller chunks because of memory usage I would suggest not reading on a per line basis, (since the lines are rather short), but process bigger chunks using something like a BufferedReader instead?

我用1.25GB文件摆弄了一下,以下是比实施速度快2.5倍:

I fiddled around a bit with a 1.25GB file and the following is about 2.5 times faster than your implementation:

private static boolean matches() throws IOException {
   String regex = "(?m)^0007476\|\d{12}\|0031386500.*$";
   Pattern pattern = Pattern.compile(regex);

   try(BufferedReader br = new BufferedReader(new FileReader(FILENAME))) {
      for(String lines; (lines = readLines(br, 10000)) != null; ) {
         if (pattern.matcher(lines).find()) {
            return true;
         }
      }
   }

   return false;
}

private static String readLines(BufferedReader br, int amount) throws IOException {
   StringBuilder builder = new StringBuilder();
   int lineCounter = 0;
   for(String line; (line = br.readLine()) != null && lineCounter < amount; lineCounter++ ) {
      builder.append(line).append(System.lineSeparator());
   }

    return lineCounter > 0 ? builder.toString() : null;
}

这篇关于在java中查找非常大的格式化文本文件中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆