DBSCAN的ELKI实现中的sample_weight选项 [英] sample_weight option in the ELKI implementation of DBSCAN

查看:76
本文介绍了DBSCAN的ELKI实现中的sample_weight选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是在包含许多几乎重复的点的数据集中找到离群值,并且我想使用DBSCAN的ELKI实现来完成此任务.

My goal is to find outliers in a dataset that contains many near-duplicate points and I want to use ELKI implementation of DBSCAN for this task.

由于我不关心集群本身而是异常值(我认为离群值相对较远),所以我想通过在网格上聚集/合并点并使用scikit中实现的概念来加快运行时间-以 sample_weight .

As I don't care about the clusters themselves just the outliers (which I assume are relatively far from the clusters), I want to speed up the runtime by aggregating/binning points on a grid and using the concept implemented in scikit-learn as sample_weight.

您能显示最少的代码以在ELKI中进行类似的分析吗?

Can you please show minimum code to do similar analysis in ELKI?

假设我的数据集包含两列 features (xy网格上的汇总/合并点的坐标)和sample_weights sample_weight_feature 的第三列(原始数据集点数在汇总/合并点的附近).在scikit-learn中,我期望的答案是-以下列方式调用函数 fit : fit(self,features,y = None,sample_weight = sample_weight_feature)

Let's assume my dataset contains two columns of features (aggregated/binned points' coordinates on the x-y grid) and third column of sample_weights sample_weight_feature (number of original dataset points in the neighbourhood of the aggregated/binned points). In scikit-learn the answer I expect would be -- call function fit in the following way: fit(self, features, y=None, sample_weight=sample_weight_feature)

推荐答案

尽管可以通过 GeneralizedDBSCAN 类轻松添加,但当前未在ELKI中实现.您无需计算邻居的总和即可.

This is currently not implemented in ELKI, although it can easily be added by means of the GeneralizedDBSCAN class. Instead of counting the neighbors, you'd sum their weights.

为此,您需要修改 GeneralizedDBSCAN CorePredicate ,以获得"WeightedCorePredicate".只要您从Java实例化对象(并将关系直接传递给类),这就应该非常简单-您只需在实例化"WeightedCorePredicate"时传递权重关系即可.仅当您尝试通过命令行使其全部可用以指定输入格式以及如何选择正确的关系和列时,这才变得困难.

For this you need to modify the CorePredicate of GeneralizedDBSCAN to obtain a "WeightedCorePredicate". As long as you instantiate the objects from Java (and pass the relations directly to the classes) this should be fairly simple - you simply pass the relation of weights when instantiating your "WeightedCorePredicate". It only becomes difficult once you try to make it all available by command line to specify the input format, and how it selects the right relations and columns.

尽管使此命令行和minigui可用并不是一件容易的事,因为您将需要权重的第二个关系.从Java代码开始,一旦您了解了使用关系而不是数组来处理所有内容的概念,它就相当容易做到.大致来说,对于每个邻居,您都需要从权重关系中添加权重,然后将其与阈值进行比较,而不是将计数与"minpts"整数进行比较.

It's not trivial though to make this command-line and minigui-usable, as you will need a second relation for the weights. From Java code, its fairly easy to do once you have understood the concept of using relations instead of arrays for everything. Roughly, for every neighbor you add the weights from the weight relation and compare it to a threshold instead of comparing the count to the "minpts" integer.

由于最近另一位用户已请求此请求,我希望提出请求以将其贡献给ELKI.

Since this has recently been requested by another user, I would appreciate a pull request to contribute this to ELKI.

关于离群值检测的目标,我建议改为使用 designed 设计的方法来进行离群值检测.例如,局部离群因子,甚至简单的k最近邻检测器都可以正常工作,并且比DBSCAN更快.我不相信您的方法会带来很多好处-在索引结构的帮助下,DBSCAN通常是相当快的.重复数据删除方法可能已经和具有类似基于网格的数据索引的DBSCAN一样昂贵.

As for the goal of outlier detection, I suggest to rather use a method designed for outlier detection instead. For example Local Outlier Factor, or even the simple k-nearest-neighbor detectors should work fine, and can be faster than DBSCAN. I am not convinced that your approach yields a lot of benefits - with the help of index structures, DBSCAN usually is quite fast; and likely your de-duplication approach is already as expensive as DBSCAN with a similar grid-based data index.

这篇关于DBSCAN的ELKI实现中的sample_weight选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆