改进Java 8的方式来找到“战争与和平”中最常见的单词。 [英] Improving the Java 8 way of finding the most common words in "War and Peace"

查看:154
本文介绍了改进Java 8的方式来找到“战争与和平”中最常见的单词。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Richard Bird的书中读到了这个问题:在前五个最常用的单词 >战争与和平(或任何其他文本)。

I read this problem in Richard Bird's book: Find the top five most common words in War and Peace (or any other text for that matter).

这是我目前的尝试:

public class WarAndPeace {
    public static void main(String[] args) throws Exception {
        Map<String, Integer> wc =
            Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
            .map(line -> line.replaceAll("\\p{Punct}", ""))
            .flatMap(line -> Arrays.stream(line.split("\\s+")))
            .filter(word -> word.matches("\\w+"))
            .map(s -> s.toLowerCase())
            .filter(s -> s.length() >= 2)
            .collect(Collectors.toConcurrentMap(
                    w -> w, w -> 1, Integer::sum));

        wc.entrySet()
            .stream()
            .sorted((e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()))
            .limit(5)
            .forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));

    }
}

这绝对看起来很有趣并且运行合理快速。在我的笔记本电脑上打印以下内容:

This definitely looks interesting and runs reasonably fast. On my laptop it prints the following:

$> time java -server -Xmx10g -cp target/classes tmp.WarAndPeace
the: 34566
and: 22152
to: 16716
of: 14987
a: 10521
java -server -Xmx10g -cp target/classes tmp.WarAndPeace  1.86s user 0.13s system 274% cpu 0.724 total

它通常在2秒内运行。你能否从表现力和表现的角度建议对此进一步改进?

It usually runs in under 2 seconds. Can you suggest further improvements to this from an expressiveness and a performance standpoint?

PS:如果你对这个问题的丰富历史感兴趣,请参阅这里

PS: If you are interested in the rich history of this problem, see here.

推荐答案

您正在重新编译每行和每个单词的所有正则表达式。而不是 .flatMap(line - > Arrays.stream(line.split(\\\\ +))) .flatMap( Pattern.compile( \\s +):: splitAsStream) .filter(word - > word.matches(\\\\ + +)):使用 .filter(Pattern.compile) ( ^ \\w + $)。asPredicate())地图相同。

You're recompiling all the regexps on every line and on every word. Instead of .flatMap(line -> Arrays.stream(line.split("\\s+"))) write .flatMap(Pattern.compile("\\s+")::splitAsStream). The same for .filter(word -> word.matches("\\w+")): use .filter(Pattern.compile("^\\w+$").asPredicate()). The same for map.

交换 .map(s - >)可能更好; s.toLowerCase()) .filter(s - > s.length()> = 2)为了不打电话单字母单词的 toLowerCase()

Probably it's better to swap .map(s -> s.toLowerCase()) and .filter(s -> s.length() >= 2) in order not to call toLowerCase() for one-letter words.

你不应该使用 Collectors.toConcurrentMap(w - > w,w - > 1,Integer :: sum)。首先,您的流不是并行的,因此您可以使用 toMap 轻松替换 toConcurrentMap 。其次,使用 Collectors.groupingBy(w - > w,Collectors.summingInt(w - > 1))可能会更有效(尽管需要测试)因为这会减少装箱(但是添加一个修整器步骤,它将立即装箱所有值)。

You should not use Collectors.toConcurrentMap(w -> w, w -> 1, Integer::sum). First, your stream is not parallel, so you may easily replace toConcurrentMap with toMap. Second, it would probably be more efficient (though testing is necessary) to use Collectors.groupingBy(w -> w, Collectors.summingInt(w -> 1)) as this would reduce boxing (but add a finisher step which will box all the values at once).

而不是(e1,e2) - > Integer.compare(e2.getValue(),e1.getValue())您可以使用就绪比较器: Map.Entry.comparingByValue()(虽然这可能是一个品味问题。)

Instead of (e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()) you may use ready comparator: Map.Entry.comparingByValue() (though probably it's a matter of taste).

总结:

Map<String, Integer> wc =
    Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
        .map(Pattern.compile("\\p{Punct}")::matcher)
        .map(matcher -> matcher.replaceAll(""))
        .flatMap(Pattern.compile("\\s+")::splitAsStream)
        .filter(Pattern.compile("^\\w+$").asPredicate())
        .filter(s -> s.length() >= 2)
        .map(s -> s.toLowerCase())
        .collect(Collectors.groupingBy(w -> w,
                Collectors.summingInt(w -> 1)));

wc.entrySet()
    .stream()
    .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
    .limit(5)
    .forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));

如果你不喜欢方法引用(有些人不喜欢),你可以存储预编译的regexps而是在变量中。

If you don't like method references (some people don't), you may store precompiled regexps in the variables instead.

这篇关于改进Java 8的方式来找到“战争与和平”中最常见的单词。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆