改进Java 8的方式来找到“战争与和平”中最常见的单词。 [英] Improving the Java 8 way of finding the most common words in "War and Peace"

查看：154 发布时间：2019/1/14 11:58:16 java-8 java-stream

本文介绍了改进Java 8的方式来找到“战争与和平”中最常见的单词。的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Richard Bird的书中读到了这个问题：在前五个最常用的单词 >战争与和平（或任何其他文本）。

I read this problem in Richard Bird's book: Find the top five most common words in War and Peace (or any other text for that matter).

这是我目前的尝试：

public class WarAndPeace {
    public static void main(String[] args) throws Exception {
        Map<String, Integer> wc =
            Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
            .map(line -> line.replaceAll("\\p{Punct}", ""))
            .flatMap(line -> Arrays.stream(line.split("\\s+")))
            .filter(word -> word.matches("\\w+"))
            .map(s -> s.toLowerCase())
            .filter(s -> s.length() >= 2)
            .collect(Collectors.toConcurrentMap(
                    w -> w, w -> 1, Integer::sum));

        wc.entrySet()
            .stream()
            .sorted((e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()))
            .limit(5)
            .forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));

    }
}

这绝对看起来很有趣并且运行合理快速。在我的笔记本电脑上打印以下内容：

This definitely looks interesting and runs reasonably fast. On my laptop it prints the following:

$> time java -server -Xmx10g -cp target/classes tmp.WarAndPeace
the: 34566
and: 22152
to: 16716
of: 14987
a: 10521
java -server -Xmx10g -cp target/classes tmp.WarAndPeace  1.86s user 0.13s system 274% cpu 0.724 total

它通常在2秒内运行。你能否从表现力和表现的角度建议对此进一步改进？

It usually runs in under 2 seconds. Can you suggest further improvements to this from an expressiveness and a performance standpoint?

PS：如果你对这个问题的丰富历史感兴趣，请参阅这里。

PS: If you are interested in the rich history of this problem, see here.

推荐答案

您正在重新编译每行和每个单词的所有正则表达式。而不是 .flatMap（line - > Arrays.stream（line.split（\\\\ +）））写 .flatMap（ Pattern.compile（ \\s +）:: splitAsStream）。 .filter（word - > word.matches（\\\\ + +））：使用 .filter（Pattern.compile）（ ^ \\w + $）。asPredicate（））。 地图相同。

You're recompiling all the regexps on every line and on every word. Instead of .flatMap(line -> Arrays.stream(line.split("\\s+"))) write .flatMap(Pattern.compile("\\s+")::splitAsStream). The same for .filter(word -> word.matches("\\w+")): use .filter(Pattern.compile("^\\w+$").asPredicate()). The same for map.

交换 .map（s - >）可能更好; s.toLowerCase（））和 .filter（s - > s.length（）> = 2）为了不打电话单字母单词的 toLowerCase（）。

Probably it's better to swap .map(s -> s.toLowerCase()) and .filter(s -> s.length() >= 2) in order not to call toLowerCase() for one-letter words.

你不应该使用 Collectors.toConcurrentMap（w - > w，w - > 1，Integer :: sum）。首先，您的流不是并行的，因此您可以使用 toMap 轻松替换 toConcurrentMap 。其次，使用 Collectors.groupingBy（w - > w，Collectors.summingInt（w - > 1））可能会更有效（尽管需要测试）因为这会减少装箱（但是添加一个修整器步骤，它将立即装箱所有值）。

You should not use Collectors.toConcurrentMap(w -> w, w -> 1, Integer::sum). First, your stream is not parallel, so you may easily replace toConcurrentMap with toMap. Second, it would probably be more efficient (though testing is necessary) to use Collectors.groupingBy(w -> w, Collectors.summingInt(w -> 1)) as this would reduce boxing (but add a finisher step which will box all the values at once).

而不是（e1，e2） - > Integer.compare（e2.getValue（），e1.getValue（））您可以使用就绪比较器： Map.Entry.comparingByValue（）（虽然这可能是一个品味问题。）

Instead of (e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()) you may use ready comparator: Map.Entry.comparingByValue() (though probably it's a matter of taste).

总结：

Map<String, Integer> wc =
    Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
        .map(Pattern.compile("\\p{Punct}")::matcher)
        .map(matcher -> matcher.replaceAll(""))
        .flatMap(Pattern.compile("\\s+")::splitAsStream)
        .filter(Pattern.compile("^\\w+$").asPredicate())
        .filter(s -> s.length() >= 2)
        .map(s -> s.toLowerCase())
        .collect(Collectors.groupingBy(w -> w,
                Collectors.summingInt(w -> 1)));

wc.entrySet()
    .stream()
    .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
    .limit(5)
    .forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));

如果你不喜欢方法引用（有些人不喜欢），你可以存储预编译的regexps而是在变量中。

If you don't like method references (some people don't), you may store precompiled regexps in the variables instead.

这篇关于改进Java 8的方式来找到“战争与和平”中最常见的单词。的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

改进Java 8的方式来找到“战争与和平”中最常见的单词。 [英] Improving the Java 8 way of finding the most common words in "War and Peace"

问题描述

推荐答案

相关文章

Java相关最新文章

热门教程

热门工具

登录关闭

改进Java 8的方式来找到“战争与和平”中最常见的单词。 [英] Improving the Java 8 way of finding the most common words in &quot;War and Peace&quot;

问题描述

推荐答案

相关文章

Java相关最新文章

热门教程

热门工具

登录 关闭

改进Java 8的方式来找到“战争与和平”中最常见的单词。 [英] Improving the Java 8 way of finding the most common words in "War and Peace"

登录关闭