交叉过滤器“双重分组”其中键是另一还原的值 [英] crossfilter "double grouping" where key is the value of another reduction

查看:54
本文介绍了交叉过滤器“双重分组”其中键是另一还原的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我有关mac地址的数据。每分钟记录一次。每分钟,我都有许多唯一的Mac地址。

Here is my data about mac address. It is recorded per minute. For each minute, I have many unique Mac addresses.

mac_add,created_time
18:59:36:12:23:33,2016-12-07 00:00:00.000
1c:e1:92:34:d7:46,2016-12-07 00:00:00.000
2c:f0:ee:86:bd:51,2016-12-07 00:00:00.000
5c:cf:7f:d3:2e:ce,2016-12-07 00:00:00.000
...
18:59:36:12:23:33,2016-12-07 00:01:00.000
1c:cd:e5:1e:99:78,2016-12-07 00:01:00.000
1c:e1:92:34:d7:46,2016-12-07 00:01:00.000
5c:cf:7f:22:01:df,2016-12-07 00:01:00.000
5c:cf:7f:d3:2e:ce,2016-12-07 00:01:00.000
...

我想使用dc.js和crossfilter创建2条形图。

I would like to create 2 bar charts using dc.js and crossfilter. Please refer to the image for the charts.

第一个条形图很容易创建。它是可刷的。我创建了 created_time维度,并通过 mac_add创建了一个分组和reduceCount,如下所示:

The first bar chart is easy enough to create. It is brushable. I created the "created_time" dimension, and created a group and reduceCount by "mac_add", such as below:

var moveTime = ndx.dimension(function (d) {
                    return d.dd; //# this is the created_time
                });
var timeGroup = moveTime.group().reduceCount(function (d) {
                    return d.mac_add;
                });
var visitorChart = dc.barChart('#visitor-no-bar');
visitorChart.width(990) 
                .height(350)
                .margins({ top: 0, right: 50, bottom: 20, left: 40 })
                .dimension(moveTime)
                .group(timeGroup)
                .centerBar(true)
                .gap(1)
                .elasticY(true)
                .x(d3.time.scale().domain([new Date(2016, 11, 7), new Date(2016, 11, 13)]))
                .round(d3.time.minute.round)
                .xUnits(d3.time.minute);

visitorChart.render();

问题在第二个条形图上。想法是,数据的一行等于1分钟,因此我可以对每个mac地址的所有分钟进行汇总和求和,以通过 mac_add创建另一个维度来获得每个mac地址的时间长度。并在 mac_add上执行reduceCount得到时间长度。然后的目标是将时间长度按30分钟分组。因此,我们可以获得多少个时间长度在30分钟以下的mac地址,多少个mac_add时间在30分钟至1小时之间,多少mac_add时间在1小时至1.5小时之间,等等...

The problem is on the second bar chart. The idea is that, one row of the data equals 1 minute, so I can aggregate and sum all minutes of each mac address to get the time length of each mac addresses, by creating another dimension by "mac_add" and do reduceCount on "mac_add" to get the time length. Then the goal is to group the time length by 30 minutes. So we can get how many mac address that have time length of 30 min and less, how many mac_add that have time length between 30 min and 1 hour, how many mac_add that have time length between 1 hour and 1.5 hour, etc...

如果我错了,请纠正我。逻辑上,我认为第二个条形图的尺寸应该是时间长度的组(例如,<30,<1hr,<1.5hr等)。但是时间长度组本身不是固定的。这取决于第一个图表的画笔选择。也许它只包含30分钟,也许它仅包含1.5小时,也许它包含1.5小时和2小时,等等...

Please correct me if I am wrong. Logically, I was thinking the dimension of the second bar chart should be the group of time length (such as <30, <1hr, < 1.5hr, etc). But the time length group themselves are not fix. It depends on the brush selection of the first chart. Maybe it only contains 30 min, maybe it only contains 1.5 hours, maybe it contains 1.5 hours and 2 hours, etc...

所以我真的很困惑第二个参数条形图。以及获取所需参数的方法(如何对分组的数据进行分组)。请帮助我解释解决方案。

So I am really confused what parameters to put into the second bar chart. And method to get the required parameters (how to group a grouped data). Please help me to explain the solution.

致谢,
Marvin

Regards, Marvin

推荐答案

我认为我们过去将其称为双重分组,但是我找不到前面的问题。

I think we've called this a "double grouping" in the past, but I can't find the previous questions.

我先从一个用于Mac地址的常规交叉过滤器组开始,然后生成一个伪造的组,以分钟为单位进行汇总。

I'd start with a regular crossfilter group for the mac addresses, and then produce a fake group to aggregate by count of minutes.

var minutesPerMacDim = ndx.dimension(function(d) { return d.mac_add; }),
    minutesPerMapGroup = minutesPerMacDim.group();

function bin_keys_by_value(group, bin_value) {
    var _bins;
    return {
        all: function() {
            var bins = {};
            group.all().forEach(function(kv) {
                var valk = bin_value(kv.value);
                bins[valk] = bins[valk] || [];
                bins[valk].push(kv.key);
            });
            _bins = bins;
            // note: Object.keys returning numerical order here might not
            // work everywhere, but I couldn't find a browser where it didn't
            return Object.keys(bins).map(function(bin) {
                return {key: bin, value: bins[bin].length};
            })
        },
        bins: function() {
            return _bins;
        }
    };
}

function bin_30_mins = function(v) {
    return 30 * Math.ceil(v/30);
}

var macsPerMinuteCount = bin_keys_by_value(minutesPerMacGroup);

这将保留每个时间段的mac地址,稍后我们将对其进行过滤。在假组中添加非标准方法 bins 是不常见的,但是考虑到过滤接口只能将其保留,我想不出一种有效的方法来保留该信息。

This will retain the mac addresses for each time bin, which we'll need for filtering later. It's uncommon to add a non-standard method bins to a fake group, but I can't think of an efficient way to retain that information, given that the filtering interface will only give us access to the keys.

由于该函数具有合并功能,因此我们甚至可以使用阈值比例,如果我们想要更复杂的垃圾箱,而不仅仅是四舍五入30分钟。 量化比例是执行上述舍入操作的更通用方法。

Since the function takes a binning function, we could even use a threshold scale if we wanted more complicated bins than just rounding up to the nearest 30 minutes. A quantize scale is a more general way to do the rounding shown above.

使用这些数据可以很容易地绘制图表:我们可以照常使用维度和假组。

Using this data to drive a chart is simple: we can use the dimension and fake group as usual.

chart
    .dimension(minutesPerMacDim)
    .group(macsPerMinuteCount)

设置图表以便filter有点复杂:

Setting up the chart so that it can filter is a bit more complicated:

chart.filterHandler(function(dimension, filters) {
    if(filters.length === 0)
        dimension.filter(null);
    else {
        var bins = chart.group().bins(); // retrieve cached bins
        var macs = filters.map(function(key) { return bins[key]; })
        macs = Array.prototype.concat.apply([], macs);
        var macset = d3.set(macs);
        dimension.filterFunction(function(key) {
            return macset.has(key);
        })
    }
})

回想一下,我们重新使用在Mac地址上键入的维度;这很好,因为我们要过滤mac地址。但是图表正在接收其键的分钟计数,并且过滤器将包含这些键,例如 30 60 90 等,因此我们需要提供 filterHandler ,它需要分钟计数键并根据这些键过滤尺寸。

Recall that we're using a dimension which is keyed on mac addresses; this is good because we want to filter on mac addresses. But the chart is receiving minute-counts for its keys, and the filters will contain those keys, like 30, 60, 90, etc. So we need to supply a filterHandler which takes minute-count keys and filters the dimension based on those.

注1:这都是未经测试的,因此,如果它不起作用,请以小提琴或bl.ock的形式发布示例-有小提琴和积木,您可以分叉上手在主页上

Note 1: This is all untested, so if it doesn't work, please post an example as a fiddle or bl.ock - there are fiddles and blocks you can fork to get started on the main page.

注2:严格来说,这并不是在测量连接的长度:它是在计算连接的总分钟数。不确定这是否对您重要。如果用户断开连接,然后在该时间段内重新连接,则这两个会话将被计为一个会话。我认为您必须进行预处理才能获得持续时间。

Note 2: Strictly speaking, this is not measuring the length of connections: it's counting the total number of minutes connected. Not sure if this matters to you. If a user disconnects and then reconnects within the timeframe, the two sessions will be counted as one. I think you'd have to preprocess to get duration.

编辑:根据您的小提琴(谢谢!),上述代码似乎去工作。只需设置x比例尺和 xUnits 即可。

EDIT: Based on your fiddle (thank you!) the code above does seem to work. It's just a matter of setting up the x scale and xUnits properly.

  chart2
      .x(d3.scale.linear().domain([60,1440]))
      .xUnits(function(start, end) {
          return (end-start)/30;
      })

在这里线性标尺就可以了-我不会由于已经设置了30分钟的划分,因此请尝试量化那个规模。我们确实需要设置 xUnits ,以便dc.js知道制作条形的宽度。

A linear scale will do just fine here - I wouldn't try to quantize that scale, since the 30-minute divisions are already set up. We do need to set the xUnits so that dc.js knows how wide to make the bars.

我不确定为什么 elasticX 在这里不起作用,但是< 30 bin完全使其他所有内容都相形见so,所以我

I'm not sure why elasticX didn't work here, but the <30 bin completely dwarfed everything else, so I thought it was best to leave that out.

小提琴的叉子: https://jsfiddle.net/gordonwoodhull/2a8ow1ay/2/

这篇关于交叉过滤器“双重分组”其中键是另一还原的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆