在Java中解析流时跟踪找到的最多5个值的最佳方法 [英] Best way to keep track of maximum 5 values found while parsing a stream in Java

查看:43
本文介绍了在Java中解析流时跟踪找到的最多5个值的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在逐行解析一个大文件,读取每一行中的子字符串.我将从每个子字符串中获取一个整数值,每行〜30,并且需要从文件中获取最高的5个值.哪种数据结构对跟踪过程中的5个最大值最有效?

I'm parsing a large file, line by line, reading substrings in each line. I will obtain an integer value from each substring, ~30 per line, and need to take the return the highest 5 values from the file. What data structure will be the most efficient for keeping track of the 5 largest values while going through?

推荐答案

通常使用

This problem is usually solved with a heap, but (perhaps counter-intuitively) you use a min-heap (the smallest element is the "top" of the heap).

算法基本上是这样的:


   for each item parsed
      if the heap contains less than n items, 
         add the new item to the heap
      else
         if the new item is "greater" than the "smallest" item in the heap
            remove the smallest item and replace it with the new item

完成后,您可以 pop 将堆中的元素从最小到最大.

When you are done, you can pop the elements off the heap from least to greatest.

或者,具体来说:

  static <T extends Comparable<T>> List<T> top(Iterable<? extends T> items, int k) {
    if (k < 0) throw new IllegalArgumentException();
    if (k == 0) return Collections.emptyList();
    PriorityQueue<T> top = new PriorityQueue<>(k);
    for (T item : items) {
      if (top.size() < k) top.add(item);
      else if (item.compareTo(top.peek()) > 0) {
        top.remove();
        top.add(item);
      }
    }
    List<T> hits = new ArrayList<>(top.size());
    while (!top.isEmpty())
      hits.add(top.remove());
    Collections.reverse(hits);
    return hits;
  }

您可以将新项目与

You can compare the new item to the top of the heap efficiently, and you don't need to keep all of the elements strictly ordered all the time, so this is faster than a completely ordered collection like a TreeSet.

对于五个元素的简短列表,在数组上进行迭代可能会更快.但是,如果热门歌曲"集合的规模增加,则这种基于堆的方法将胜出.

For a very short list of five elements, iterating over an array may be faster. But if the size of the "top hits" collection grows, this heap-based method will win out.

这篇关于在Java中解析流时跟踪找到的最多5个值的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆