避免在自定义交叉过滤器归约函数中出现多个和 [英] Avoid multiple sums in custom crossfilter reduce functions

查看:103
本文介绍了避免在自定义交叉过滤器归约函数中出现多个和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题源于创建crossfilter数据集的一些困难,特别是在如何对不同维度进行分组和计算派生值方面.最终目的是使用维和组来创建许多dc.js图.

(小提琴示例 https://jsfiddle.net/raino01r/0vjtqsjL/)

问题

在继续进行设置说明之前,关键问题如下:

如何创建自定义的addremoveinit函数以传入.reduce,以使前两个函数不会多次叠加同一功能?

数据

比方说,我想监视许多机器的故障率(仅作为示例).我使用不同的维度来执行此操作:月份,机器的位置和故障类型.

例如,我具有以下形式的数据:

| month   | room | failureType | failCount | machineCount |
|---------|------|-------------|-----------|--------------|
| 2015-01 |  1   |  A          |  10       |  5           |
| 2015-01 |  1   |  B          |   2       |  5           |
| 2015-01 |  2   |  A          |   0       |  3           |
| 2015-01 |  2   |  B          |   1       |  3           |
| 2015-02 |  .   |  .          |   .       |  .           |

预期

对于三个给定的维度,我应该具有:

  • month_1_rate = $ \ frac {10 + 2 + 0 + 1} {5 + 3} $;
  • room_1_rate = $ \ frac {10 + 2} {5} $;
  • type_A_rate = $ \ frac {10 + 0} {5 + 3} $.

想法

本质上,此设置中重要的是一对(day, room). IE.给定一天和一个房间,应该附加一个费率(然后交叉过滤器应考虑其他过滤器).

因此,一种可行的方法可能是存储已经使用过的夫妇,并且不对它们进行求和-但是我们仍然想更新failCount值.

尝试(失败)

我的尝试是创建自定义的reduce函数,而不是对已经考虑在内的MachineCount求和.

但是,有一些意外的行为. 我确定这不是要走的路,所以我希望对此有一些建议. //维度是以下之一: //ndx = crossfilter(data); //ndx.dimension(function(d){return d.month;}) //ndx.dimension(function(d){return d.room;}) //ndx.dimension(function(d){return d.failureType;}) //目标:有一种通用的方法来获取给定维度的组:

function get_group(dim){
    return dim.group().reduce(add_rate, remove_rate, initial_rate);
}

// month is given as datetime object
var monthNameFormat = d3.time.format("%Y-%m");
//
function check_done(p, v){
    return p.done.indexOf(v.room+'_'+monthNameFormat(v.month))==-1;
}    

// The three functions needed for the custom `.reduce` block.
function add_rate(p, v){
    var index = check_done(p, v);
    if (index) p.done.push(v.room+'_'+monthNameFormat(v.month));
    var count_to_sum = (index)? v.machineCount:0;
    p.mach_count += count_to_sum;
    p.fail_count += v.failCount;
    p.rate = (p.mach_count==0) ? 0 : p.fail_count*1000/p.mach_count;
    return p;
}
function remove_rate(p, v){
    var index = check_done(p, v);
    var count_to_subtract = (index)? v.machineCount:0;
    if (index) p.done.push(v.room+'_'+monthNameFormat(v.month));
    p.mach_count -= count_to_subtract;
    p.fail_count -= v.failCount;
    p.rate = (p.mach_count==0) ? 0 : p.fail_count*1000/p.mach_count;
    return p;
}
function initial_rate(){
    return {rate: 0, mach_count:0, fail_count:0, done: new Array()};
}

与dc.js的连接

如前所述,需要前面的代码来创建dimension, group,以便使用dc.js在三个不同的条形图中传递.

每个图都有.valueAccessor(function(d){return d.value.rate};).

请参阅jsfiddle( https://jsfiddle.net/raino01r/0vjtqsjL/),用于实施.数字不同,但数据结构相同.请注意,小提琴中的Machine count预期为18(在两个月中),但是您总是会得到双精度的(由于2个不同的位置).


编辑

Reduction + dc.js

在回答Ethan Jewett之后,我使用reductio进行分组.更新的小提琴在这里 https://jsfiddle.net/raino01r/dpa3vv69/

我的reducer对象在对machineCount值求和时需要两个异常(month, room).因此,它的构建如下:

var reducer = reductio()
reducer.value('mach_count')
       .exception(function(d) { return d.room; })
       .exception(function(d) { return d.month; })
       .exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
       .sum(function(d) { return d.failCount; })

这似乎在绘制图形时固定了数字.

但是,当过滤一个月并查看type图中的数字时,我确实有一个奇怪的行为.

可能的解决方案

宁可双重创建两个异常,在处理数据时也可以合并两个字段. IE.一旦定义了数据,我就会想到:

data.foreach(function(x){
    x['room_month'] = x['room'] + '_' + x['month'];
})

那么上面的归约代码应该变成:

var reducer = reductio()
reducer.value('mach_count')
       .exception(function(d) { return d.room_month; })
       .exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
       .sum(function(d) { return d.failCount; })

此解决方案似乎有效.但是我不确定这是否是明智的选择:如果数据集很大,添加新功能可能会大大减慢速度!

解决方案

几件事:

  1. 不要在您的Crossfilter减速器中计算费率.计算费率的组成部分.这将保持更简单和更快.在您的值访问器中进行实际的划分.

  2. 您基本上已经有了正确的想法.我认为我会立即看到两个问题:

    • 在您的remove_rate中,您没有从p.done阵列中删除密钥.您应该执行类似if (index) p.done.splice(p.done.indexOf(v.room+'_'+monthNameFormat(v.month)), 1);的操作将其删除.

    • 在reduce函数中,index是布尔值. (index == -1)永远不会评估为true,IIRC.因此,您添加的计算机数将始终为0.请改用var count_to_sum = index ? v.machineCount:0;.

我敢肯定,如果您想整理一个可行的例子,我或其他人将很乐意为您服务.

您可能还想尝试还原. Crossfilter减速器很难正确有效地执行,因此使用库来提供帮助可能很有意义.使用Reductio,可以创建一个计算计算机数量和故障数量的组,如下所示:

var reducer = reductio()
reducer.value('mach_count')
  .exception(function(d) { return d.room; })
  .exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
  .sum(function(d) { return d.failCount; })

var dim = ndx.dimension(...)
var grp = dim.group()
reducer(group)

This question arise from some difficulties in creating a crossfilter dataset, in particular on how to group the different dimension and compute a derived values. The final aim is to have a number of dc.js graphs using the dimensions and groups.

(Fiddle example https://jsfiddle.net/raino01r/0vjtqsjL/)

Question

Before going on with the explanation of the setting, the key question is the following:

How to create custom add, remove, init, functions to pass in .reduce so that the first two do not sum multiple times the same feature?

Data

Let's say I want to monitor the failure rate of a number of machines (just an example). I do this using different dimension: month, machine's location, and type of failure.

For example I have the data in the following form:

| month   | room | failureType | failCount | machineCount |
|---------|------|-------------|-----------|--------------|
| 2015-01 |  1   |  A          |  10       |  5           |
| 2015-01 |  1   |  B          |   2       |  5           |
| 2015-01 |  2   |  A          |   0       |  3           |
| 2015-01 |  2   |  B          |   1       |  3           |
| 2015-02 |  .   |  .          |   .       |  .           |

Expected

For the three given dimensions, I should have:

  • month_1_rate = $\frac{10+2+0+1}{5+3}$;
  • room_1_rate = $\frac{10+2}{5}$;
  • type_A_rate = $\frac{10+0}{5+3}$.

Idea

Essentially, what counts in this setting is the couple (day, room). I.e. given a day and a room there should be a rate attached to them (then the crossfilter should act to take in account the other filters).

Therefore, a way to go could be to store the couples that have already been used and do not sum machineCount for them - however we still want to update the failCount value.

Attempt (failing)

My attempt was to create custom reduce functions and not summing MachineCount that were already taken into account.

However there are some unexpected behaviours. I'm sure this is not the way to go - so I hope to have some suggestion on this. // A dimension is one of: // ndx = crossfilter(data); // ndx.dimension(function(d){return d.month;}) // ndx.dimension(function(d){return d.room;}) // ndx.dimension(function(d){return d.failureType;}) // Goal: have a general way to get the group given the dimension:

function get_group(dim){
    return dim.group().reduce(add_rate, remove_rate, initial_rate);
}

// month is given as datetime object
var monthNameFormat = d3.time.format("%Y-%m");
//
function check_done(p, v){
    return p.done.indexOf(v.room+'_'+monthNameFormat(v.month))==-1;
}    

// The three functions needed for the custom `.reduce` block.
function add_rate(p, v){
    var index = check_done(p, v);
    if (index) p.done.push(v.room+'_'+monthNameFormat(v.month));
    var count_to_sum = (index)? v.machineCount:0;
    p.mach_count += count_to_sum;
    p.fail_count += v.failCount;
    p.rate = (p.mach_count==0) ? 0 : p.fail_count*1000/p.mach_count;
    return p;
}
function remove_rate(p, v){
    var index = check_done(p, v);
    var count_to_subtract = (index)? v.machineCount:0;
    if (index) p.done.push(v.room+'_'+monthNameFormat(v.month));
    p.mach_count -= count_to_subtract;
    p.fail_count -= v.failCount;
    p.rate = (p.mach_count==0) ? 0 : p.fail_count*1000/p.mach_count;
    return p;
}
function initial_rate(){
    return {rate: 0, mach_count:0, fail_count:0, done: new Array()};
}

Connection with dc.js

As mentioned, the previous code is needed to create dimension, group to be passed in three different bar graphs using dc.js.

Each graph will have .valueAccessor(function(d){return d.value.rate};).

See the jsfiddle (https://jsfiddle.net/raino01r/0vjtqsjL/), for an implementation. Different numbers, but the datastructure is the same. Notice the in the fiddle you expect a Machine count to be 18 (in both months), however you always get the double (because of the 2 different locations).


Edit

Reduction + dc.js

Following Ethan Jewett answer, I used reductio to take care of the grouping. The updated fiddle is here https://jsfiddle.net/raino01r/dpa3vv69/

My reducer object needs two exception (month, room), when summing the machineCount values. Hence it is built as follows:

var reducer = reductio()
reducer.value('mach_count')
       .exception(function(d) { return d.room; })
       .exception(function(d) { return d.month; })
       .exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
       .sum(function(d) { return d.failCount; })

This seems to fix the numbers when the graphs are rendered.

However, I do have a strange behaviour when filtering one single month and looking at the numbers in the type graph.

Possible solution

Rather double create two exception, I could merge the two fields when processing the data. I.e. as soon the data is defined I couls:

data.foreach(function(x){
    x['room_month'] = x['room'] + '_' + x['month'];
})

Then the above reduction code should become:

var reducer = reductio()
reducer.value('mach_count')
       .exception(function(d) { return d.room_month; })
       .exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
       .sum(function(d) { return d.failCount; })

This solution seems to work. However I am not sure if this is a sensible things to do: if the dataset is large,adding a new feature could slow down things quite a lot!

解决方案

A few things:

  1. Don't calculate rates in your Crossfilter reducers. Calculate the components of the rates. This will keep both simpler and faster. Do the actual division in your value accessor.

  2. You've basically got the right idea. I think there are two problems that I see immediately:

    • In your remove_rate your are not removing the key from the p.done array. You should be doing something like if (index) p.done.splice(p.done.indexOf(v.room+'_'+monthNameFormat(v.month)), 1); to remove it.

    • In your reduce functions, index is a boolean. (index == -1) will never evaluate to true, IIRC. So your added machine count will always be 0. Use var count_to_sum = index ? v.machineCount:0; instead.

If you want to put together a working example, I or someone else will be happy to get it going for you, I'm sure.

You may also want to try Reductio. Crossfilter reducers are difficult to do right and efficiently, so it may make sense to use a library to help. With Reductio, creating a group that calculates your machine count and failure count looks like this:

var reducer = reductio()
reducer.value('mach_count')
  .exception(function(d) { return d.room; })
  .exceptionSum(function(d) { return d.machineCount; })
reducer.value('fail_count')
  .sum(function(d) { return d.failCount; })

var dim = ndx.dimension(...)
var grp = dim.group()
reducer(group)

这篇关于避免在自定义交叉过滤器归约函数中出现多个和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆