改进Java 8的方式来找到“战争与和平”中最常见的单词。 [英] Improving the Java 8 way of finding the most common words in "War and Peace"
问题描述
我在Richard Bird的书中读到了这个问题:在前五个最常用的单词 >战争与和平(或任何其他文本)。
I read this problem in Richard Bird's book: Find the top five most common words in War and Peace (or any other text for that matter).
这是我目前的尝试:
public class WarAndPeace {
public static void main(String[] args) throws Exception {
Map<String, Integer> wc =
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.map(line -> line.replaceAll("\\p{Punct}", ""))
.flatMap(line -> Arrays.stream(line.split("\\s+")))
.filter(word -> word.matches("\\w+"))
.map(s -> s.toLowerCase())
.filter(s -> s.length() >= 2)
.collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
wc.entrySet()
.stream()
.sorted((e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
}
}
这绝对看起来很有趣并且运行合理快速。在我的笔记本电脑上打印以下内容:
This definitely looks interesting and runs reasonably fast. On my laptop it prints the following:
$> time java -server -Xmx10g -cp target/classes tmp.WarAndPeace
the: 34566
and: 22152
to: 16716
of: 14987
a: 10521
java -server -Xmx10g -cp target/classes tmp.WarAndPeace 1.86s user 0.13s system 274% cpu 0.724 total
它通常在2秒内运行。你能否从表现力和表现的角度建议对此进一步改进?
It usually runs in under 2 seconds. Can you suggest further improvements to this from an expressiveness and a performance standpoint?
PS:如果你对这个问题的丰富历史感兴趣,请参阅这里。
PS: If you are interested in the rich history of this problem, see here.
推荐答案
您正在重新编译每行和每个单词的所有正则表达式。而不是 .flatMap(line - > Arrays.stream(line.split(\\\\ +)))
写 .flatMap( Pattern.compile( \\s +):: splitAsStream)
。 .filter(word - > word.matches(\\\\ + +))
:使用 .filter(Pattern.compile) ( ^ \\w + $)。asPredicate())
。 地图
相同。
You're recompiling all the regexps on every line and on every word. Instead of .flatMap(line -> Arrays.stream(line.split("\\s+")))
write .flatMap(Pattern.compile("\\s+")::splitAsStream)
. The same for .filter(word -> word.matches("\\w+"))
: use .filter(Pattern.compile("^\\w+$").asPredicate())
. The same for map
.
交换 .map(s - >)可能更好; s.toLowerCase())
和 .filter(s - > s.length()> = 2)
为了不打电话单字母单词的 toLowerCase()
。
Probably it's better to swap .map(s -> s.toLowerCase())
and .filter(s -> s.length() >= 2)
in order not to call toLowerCase()
for one-letter words.
你不应该使用 Collectors.toConcurrentMap(w - > w,w - > 1,Integer :: sum)
。首先,您的流不是并行的,因此您可以使用 toMap
轻松替换 toConcurrentMap
。其次,使用 Collectors.groupingBy(w - > w,Collectors.summingInt(w - > 1))
可能会更有效(尽管需要测试)因为这会减少装箱(但是添加一个修整器步骤,它将立即装箱所有值)。
You should not use Collectors.toConcurrentMap(w -> w, w -> 1, Integer::sum)
. First, your stream is not parallel, so you may easily replace toConcurrentMap
with toMap
. Second, it would probably be more efficient (though testing is necessary) to use Collectors.groupingBy(w -> w, Collectors.summingInt(w -> 1))
as this would reduce boxing (but add a finisher step which will box all the values at once).
而不是(e1,e2) - > Integer.compare(e2.getValue(),e1.getValue())
您可以使用就绪比较器: Map.Entry.comparingByValue()
(虽然这可能是一个品味问题。)
Instead of (e1, e2) -> Integer.compare(e2.getValue(), e1.getValue())
you may use ready comparator: Map.Entry.comparingByValue()
(though probably it's a matter of taste).
总结:
Map<String, Integer> wc =
Files.lines(Paths.get("/tmp", "/war-and-peace.txt"))
.map(Pattern.compile("\\p{Punct}")::matcher)
.map(matcher -> matcher.replaceAll(""))
.flatMap(Pattern.compile("\\s+")::splitAsStream)
.filter(Pattern.compile("^\\w+$").asPredicate())
.filter(s -> s.length() >= 2)
.map(s -> s.toLowerCase())
.collect(Collectors.groupingBy(w -> w,
Collectors.summingInt(w -> 1)));
wc.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.limit(5)
.forEach(e -> System.out.println(e.getKey() + ": " + e.getValue()));
如果你不喜欢方法引用(有些人不喜欢),你可以存储预编译的regexps而是在变量中。
If you don't like method references (some people don't), you may store precompiled regexps in the variables instead.
这篇关于改进Java 8的方式来找到“战争与和平”中最常见的单词。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!