使用并行流有效地处理文件比较 [英] Efficiently Process file comparison using Parallel Stream

查看:158
本文介绍了使用并行流有效地处理文件比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我有多个txt文件,比如 txt1,txt2,... ,每行有一些4到22个字符的文本,我有另一个txt文件类似的值,比如 bigText 。目标是检查在任何txt文件中某处出现的 bigTxt 中的所有值并输出这些值(我们保证如果<$ c的任何行) $ c> bigTxt 在任何txt文件中,与该行匹配只发生一次)。到目前为止,我所拥有的最佳解决方案是有效的,但效率稍低。基本上,它看起来像这样:

So, I have multiple txt files, say txt1,txt2,... and each line has some text between 4 and 22 characters and I have another txt file with similar values, say bigText. The goal is to check all values that are in bigTxt that occur somewhere in any of the txt files and output those values (we're guaranteed that if any line of bigTxt is in any of the txt files, a matching with that line only happens once). The best solution I have so far works, but is slightly inefficient. Basically, it looks like this:

txtFiles.parallelStream().forEach(file->{
   List<String> txtList = listOfLines of this txtFile;
   streamOfLinesOfBigTxt.forEach(line->{
         if(txtList.contains(line)){
            System.out.println(line);
            //it'd be great if we could just stop this forEach loop here
            //but that seems hardish
         }
   });
});

(注意:我尝试使用Honza的坏主意解决方案打破forEach:从Java 8流forEach中断或返回?但这必须做这不是我想要的东西,因为它实际上使代码通常有点慢或大致相同)
这个小问题是即使在一个文件找到<$ c之间的一条线的匹配之后$ c> bigTxt 文件和其他txt文件,其他txt文件仍尝试搜索该行的检查(即使我们已经找到一个匹配,这就足够了)。我试图阻止这个的东西是首先迭代bigTxt行(不是并行,但是并行地通过每个txt文件)并使用java的 anyMatch 然后我得到了我之前理解的流已经被修改或关闭的错误类型是因为 anyMatch 正在终止。因此,在其中一个txt文件的其中一行上只调用一次 anyMatch 后,该流将不再可供我稍后处理。我想不出正确使用 findAny 的方法,我不认为 allMatch 是我想要的因为并非 bigTxt 中的每个值都必须位于其中一个txt文件中。任何(并行)解决方案(甚至不严格包括Java 8中的内容)都是受欢迎的。谢谢。

(Note: I tried breaking out of the forEach using Honza's "bad idea" solution here: Break or return from Java 8 stream forEach? but this must be doing something that's not what I want because it actually made the code usually a bit slower or about the same) The small problem with this is that even after one file has found a match of one of the lines between the bigTxt file and the other txt files, other txt files still try to search for checks with that line (even though we've already found one match and that's sufficient). Something that I tried to stop this was first iterating over the bigTxt lines (not in parallel, but going through each txt file was in parallel) and using java's anyMatch and I was getting a "stream has already been modified or closed" type of error which I understood later was because anyMatch is terminating. So, after just one call to anyMatch on one of the lines of one of the txt files, that stream was no longer available for my processing later. I couldn't think of a way to properly use findAny and I don't think allMatch is what I want either since not every value from bigTxt will necessarily be in one of the txt files. Any (parallel) solutions to this (even not strictly including things from Java 8) is welcome. Thank you.

推荐答案

如果 streamOfLinesOfBigTxt ,您将在问题中发布的代码出现相同的错误,因为您尝试使用外部流的 forEach 。目前尚不清楚为什么你没有注意到这一点,但也许你总是在程序开始处理第二个文件之前就停止了它?毕竟,对于大文件的每一行,线性搜索 List 行所需的时间与两行数的乘积成比例。

If streamOfLinesOfBigTxt is a Stream, you will get the same error with the code posted in your question, as you are trying to process that stream multiple times with your outer stream’s forEach. It’s not clear why you didn’t notice that, but perhaps you always stopped the program before it ever started processing the second file? After all, the time needed for searching the List of lines linearly for every line of the big file scales with the product of both numbers of lines.

当你说,你想要检查任何txt文件中某处出现的bigTxt中的所有值并输出这些值时,你可以直截了当地做到这一点:

When you say, you want "to check all values that are in bigTxt that occur somewhere in any of the txt files and output those values", you could do exactly that straight-forwardly:

Files.lines(Paths.get(bigFileLocation))
     .filter(line -> txtFiles.stream()
                 .flatMap(path -> {
                         try { return Files.lines(Paths.get(path)); }
                         catch (IOException ex) { throw new UncheckedIOException(ex); }
                     })
                 .anyMatch(Predicate.isEqual(line)) )
    .forEach(System.out::println);

这会造成短路,但仍然存在处理时间与<$ c一致的问题$ C> n×m个。更糟糕的是,它将重复打开并重复读取txtfiles。

This does short-circuiting, but still has the problem of a processing time that scales with n×m. Even worse, it will re-open and read the txtfiles repeatedly.

如果你想避免这种情况,将数据存储在RAM中是不可避免的。如果您存储它们,您可以首先选择支持优于线性查找的存储:

If you want to avoid that, storing data in the RAM is unavoidable. If you store them, you can choose a storage that supports a better than linear lookup in the first place:

Set<String> matchLines = txtFiles.stream()
    .flatMap(path -> {
        try { return Files.lines(Paths.get(path)); }
        catch (IOException ex) { throw new UncheckedIOException(ex); }
    })
    .collect(Collectors.toSet());

Files.lines(Paths.get(bigFileLocation))
     .filter(matchLines::contains)
     .forEach(System.out::println);

现在,它的执行时间与数字的总和成比例所有文件的行而不是产品。但它需要临时存储 txtFiles 的所有不同行。

Now, the execution time of this scales with the sum of the number of lines of all files rather than the product. But it needs a temporary storage for all distinct lines of the txtFiles.

如果大文件的不同行数较少与其他文件在一起并且顺序无关紧要,您将大文件的行存储在一个集合中,并在运行中检查 txtFiles 的行。 / p>

If the big file has fewer distinct lines than the other files together and the order doesn’t matter, you store the lines of the big file in a set instead and check the lines of the txtFiles on the fly.

Set<String> matchLines
    = Files.lines(Paths.get(bigFileLocation)).collect(Collectors.toSet());

txtFiles.stream()
        .flatMap(path -> {
            try { return Files.lines(Paths.get(path)); }
            catch (IOException ex) { throw new UncheckedIOException(ex); }
        })
        .filter(matchLines::contains)
        .forEach(System.out::println);

这依赖于所有匹配行在所有这些文本文件中都是唯一的属性,正如您所声明的那样在你的问题中。

This relies on the property that all matching lines are unique across all these text files, as you have stated in your question.

我不认为,这里的并行处理会带来任何好处,因为I / O速度将主导执行。

I don’t think, that there will be any benefit from parallel processing here, as the I/O speed will dominate the execution.

这篇关于使用并行流有效地处理文件比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆