Java算法来跟踪部分汇总值 [英] Java algorithm to track parts of aggregated values

查看:102
本文介绍了Java算法来跟踪部分汇总值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的程序已经评估了数亿条记录。因此,内存和性能问题很重要。
让每个记录都有键-ticketID。另外,记录具有字段值和字段source_name。
源TicketID中的source_name从1到很多(最大100)。
我只需要按ticketID进行汇总-可以获得近100万条记录,而且还必须减去指定source_name的值-因此我有跟踪贡献。

My program have evaluate hundreds of millions of records. So the question of memory and performance are important. Lets each record has key - ticketID. Also record has field value and field source_name. In source ticketID have from 1 to many (neary 100) source_name. I need aggregate only by ticketID - receive nearly 1 million of record, but also must have possibility subtract values for specified source_name - so I have track contributes.

是否存在一些可以解决此问题的算法或数据结构?

Do exist some algorithms or data structures that allow resolve this problem?

推荐答案

我无法完全解析问题,所以我ll假设:

I can't quite parse the question fully so I'll assume:


  • 记录将近一百万表示有将近一百万个唯一的 ticketID 字段。

  • 在系统中将近100个不同的 source_name s。

  • 并非全部 ticketId source_name s。我们没有1亿个机票ID x source_name 组合。

  • 您希望能够总计所有 ticketId ,但也希望按 source_name 总计。

  • "nearly 1 million of record" means that there is nearly 1 million unique ticketID fields.
  • "nearly 100" different source_names in the system.
  • not all ticketIds have source_names. We don't have 100 million ticketID x source_name combinations.
  • You want to be able to total all of the ticketIds but also total by source_name.

基于这些假设,我将使用地图的 Map 。外部 Map 的键为 source_name 和内部 Map 。内部 Map 的键为 ticketId 和累积的 value

With these assumptions I would use a Map of maps. The outer Map has a key of source_name and the value of the inner Map. The inner Map has a key of the ticketId and a cumulative value.

因此伪代码如下:

Map<String, Map<Integer,Double>> valueMap =
    new HashMap<String, Map<Integer,Double>>();

while (...reading in and processing data...) {
    int ticketId = ...;
    String sourceName = ...;
    double entryValue = ...;

    Map<Integer,Double> sourceNameMap = valueMap.get(sourceName);
    Double value = sourceNameMap.get(ticketId);
    if (oldValue == null) {
        value = entryValue;
    } else {
        value += entryValue;
    }
    sourceNameMap.put(ticketId, value);
}

您可以通过将每个<$ c $相加来轻松获得总数c> source_name 映射。当然,如果有帮助,您也可以为每个 source_name 保持运行总计。如果您的系统可以为JVM分配一个千兆字节,那么它应该能够处理大量的 ticketID x source_name

You can easily get the total by adding up each of the source_name maps. You can also keep a running total for each source_name of course if that helps. If your system can allocate a gigabyte to the JVM then it should be able to handle a good number of ticketID x source_name pairs.

您可能会考虑创建一个可变内部值类以节省GC周期:

You might consider creating a mutable internal value class to save on GC cycles:

private static class MutableValue {
    double value;
    public MutableValue(double value) {
        this.value = value;
    }
    public void add(double value) {
        this.value += value;
    }
}

因此,您可以说:

MutableValue value = sourceNameMap.get(ticketId);
if (oldValue == null) {
    sourceNameMap.put(new MutableValue(entryValue));
} else {
    value.add(entryValue);
}

如果您编辑了问题,我会在可能的情况下编辑答案做出了一些不正确的假设。

If you edit your question, I'll edit my answer in case I've made some improper assumptions.

这篇关于Java算法来跟踪部分汇总值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆