生成高度偏斜数据的直方图 [英] Generating histogram for highly skewed data

查看:294
本文介绍了生成高度偏斜数据的直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 dc.js crossfilter.js d3.js 以生成barchart。

I'm using dc.js, crossfilter.js and d3.js to generate a barchart.

barchart表示信用卡交易的数据。它显示交易数量(y轴)与交易金额(x轴)的比率。

The barchart represents data for credit card transactions. It plots number of transactions (y-axis) over transaction dollar amount (x-axis).

它看起来像这样:

数组基本上如下:

[
  ...
  {
    txn_id: 1,
    txn_amount: 20
  },
  ...
]


$ b b

数据根据不同的商家等而高度可变,我不能对分布进行任何假设。

The data is highly variable depending on different merchants etc and I can't make any assumptions about distributions.

正如你可以看到,这个图不是全部这是有用的,因为数据本身。在这种情况下, - $ 7500 2 在 $ 7500

As you can see this graph isn't all that useful because of the data itself. In this case there is 1 transaction for -$7500 and 2 at around $7500.

在其他金额之间,但大多数交易集中在 $ 0 - $ 100

In between there other amounts, but most transactions cluster around $0 - $100 where you can see the spike.

不幸的是,有足够的差异,你甚至不能看到不太频繁的交易金额。

Unfortunately there is enough variance that you can't even see the bars for the less frequent transaction amounts.

回答似乎很接近,但不完全正确。

This answer seems close, but not quite there.

我真正想做的是将x轴刻度分割成10个合理大小的块,将事务量合理地分组以使图更加有用。

What I'd really like to do is break the x-axis ticks into 10 reasonably-sized chunks that group the transaction amounts sensibly to make the graph more useful.

例如,在这种情况下,平均交易金额为 $ 20 。极小的最小值和最大值为 - $ 7500 $ 7500

For example let's say in this case the average transaction amount is $20. And the extreme min and max values are -$7500 and $7500

所以在这个特定的例子中,我可能想把x轴分块如此:

So in this particular example I might like to have the x-axis chunked up as so:

Bin 1: -$1000 >= transaction amount
Bin 2: -$100 >= transaction amount > -$1000
Bin 3: -$50 >= transaction amount > -$100
Bin 4: $0 >= transaction amount > -$50
Bin 5: $15 >= transaction amount > $0
Bin 6: $25 >= transaction amount > $15
Bin 7: $40 >= transaction amount > $25
Bin 8: $100 >= transaction amount > $40
Bin 9: $1000 >= transaction amount > $100
Bin 10: transaction amount > $1000

(区块/区块大小越来越接近平均值)。

(the chunk/bin size gets smaller and smaller the closer to the average we get).

不可否认,自从我对统计数据进行认真研究以来,我已经老了,所以我生锈了。但是,似乎我把数据分成bin / chucks的方式与我的数据的标准偏差有很大关系。

Admittedly it's been ages since I've done any serious study of statistics, so I'm quite rusty. But it does seem that the way I break my data up into bins/chucks will have a lot to do with the standard deviation of my data.

我想我有一个好的感觉我想要的,我只是有点失去如何使用 d3.js d3.mean() d3.quantile()?)和 dc.js

I guess I have a good feel for what I want, I'm just a bit lost on how to use d3.js (d3.mean(), d3.quantile() ?) and dc.js to get a histogram similarly to how I've described.

那么什么是正确的方法,或者我应该使用什么库:

So what's the correct way, or what libraries should I be using to:


  1. 根据任意给定的数据集创建10个合理大小的bin

  2. 将数据分组到这些bin(实际上,此部分应该非常简单)

在物理间隔直方图的x轴方面,我不认为这是必要或不希望的,刻度是不均匀的间隔(因此也许不再是一个直方图)。

In terms of the physical spacing histogram's x-axis, I don't think it's necessary or desired for the ticks to be unevenly spaced (thus perhaps it is no longer a histogram).

我更喜欢ticks保持均匀分布,尽管事实上chunk大小不相等。

I'd prefer the ticks stay evenly spaced despite the fact that chunk sizes are not equal. I will just be sure to label the ticks appropriately.

任何指向正确方向的指针都会非常感谢。

Any pointers in the right direction would be much appreciated.

更新:

所以看来 d3.js 的我像往常一样,已经得到我的背。我相信我可以使用 d3.scale.quantile()将x轴分成10个分位数(十分位数)。事实上,我设置了我的分位数比例,似乎做正确的事情,当我输入数字直接进入分位数比例函数(通过JS控制台),它输出正确的桶(10)。

So it seems the d3.js is several steps ahead of me as usual and has already got my back. I believe I can use d3.scale.quantile() to break the x-axis up into 10 quantiles (decile). Indeed, I've setup my quantile scale and it seems to be doing the right thing, when I input numbers directly into the quantile scale function (via the JS console) it outputs the correct bucket (out of the 10).

但不幸的是我的图表仍然搞乱了。这是我的代码:

But unfortunately my graph is still messed up. Here is my code:

var datum = crossfilter(data),
    amount = datum.dimension(function(d) { return +d.txn_amount; }),
    amounts = amount.group();

amountsChart = dc.barChart("#dc-amounts-chart");
amountsChart
  .width(defaultWidth)
  .height(defaultHeight)
  .margins({top: 20, right: 20, bottom: 20, left: 50})
  .dimension(amount)
  .group(amounts)
  .centerBar(true)
  .gap(5)
  .elasticY(true)
  .x(d3.scale.quantile().domain(amounts.all().map(function(d) {
                          // d.key is the transaction dollar amount,
                          // d.value is the number of transactions at that amount
                          return d.key;
                        }))
                        .range([0,1,2,3,4,5,6,7,8,9]));

amountsChart.yAxis().ticks(5);

dc.renderAll();

以及结果图表:

img src =https://i.stack.imgur.com/R1EG9.pngalt =Quantiled Bar Chart>

我想我收到了

推荐答案

您可以使用异常值测试修剪出您的,很好的离群值,然后将它们添加回到极限仓。我也将这些bin上的文本更改为y,但是这可以通过传递一个自定义的ticks到轴来完成。

You could use an outlier test to trim out your, well outliers and then add them back into the extreme bins. I'd also change the text on those bins to y, but that can easily be done by passing a custom set of ticks to the axis.

我使用 Chauvenet的标准,许多异常值测试之一。我最初想要使用Grubbs测试(或者更好的多个Grubbs Beck测试),但是有一些工作要编写代码。

I've mocked up an example using the Chauvenet's criterion, one of a number of outlier tests. I'd originally thought to use the Grubbs test (or even better the multiple Grubbs Beck test) but there's a bit of work to code that. Chauvenet's criterion works quite simply by assuming that any value greater then m standard deviations from your mean is an outlier.

我把这一切放在一起这里,函数是:

I've put this all together here and the function is:

function chauvenet (x) {
    var dMax = 3;
    var mean = d3.mean(x);
    var stdv = Math.sqrt(variance(x));
    var counter = 0;
    var temp = [];

    for (var i = 0; i < x.length; i++) {
        if(dMax > (Math.abs(x[i] - mean))/stdv) {
            temp[counter] = x[i]; 
            counter = counter + 1;
        }
    };

    return temp
}

,dMax是标准差的数量,mean是平均值,stdv是标准差(或方差的平方根)。

The terms are all fairly obvious, dMax is the number of standard deviations, mean is the mean and stdv is the standard deviation (or square root of the variance).

注意我没有添加异常值回到直方图,但应该很容易做。

Note I've not added the outliers back into the histogram, but that should be quite easy to do.

这篇关于生成高度偏斜数据的直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆