ELKI-输入距离矩阵 [英] ELKI - input distance matrix

查看:171
本文介绍了ELKI-输入距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用ELKI进行离群值检测;我有我的自定义距离矩阵,并且试图将其输入到ELKI中以执行LOF(例如,第一次).

I'm trying to use ELKI for outlier detection ; I have my custom distance matrix and I'm trying to input it to ELKI to perform LOF (for example, in a first time).

我尝试遵循 http://elki.dbs.ifi.lmu. de/wiki/HowTo/PrecomputedDistances ,但对我来说不是很清楚.我的工作:

I try to follow http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances but it is not very clear to me. What I do:

  • 我不想从数据库加载数据,所以我使用:

  • I don't want to load data from database so I use:

-dbc DBIDRangeDatabaseConnection -idgen.count 100

(其中100是我要分析的对象数)

(where 100 is the number of objects I'll be analyzing)

我使用LOF算法并调用外部距离文件

I use LOF algo and call the external distance file

-algorithm outlier.LOF
-algorithm.distancefunction external.FileBasedDoubleDistanceFunction
-distance.matrix testData.ascii -lof.k 3

我的距离文件如下(出于测试目的非常简单)

My distance file is as follows (very simple for testing purposes)

0 0 0  
0 1 1  
0 2 0.2  
0 3 0.1  
1 1 0  
1 2 0.9  
1 3 0.9  
2 2 0  
2 3 0.2  
3 3 0  
4 0 0.23  
4 1 0.97  
4 2 0.15  
4 3 0.07  
4 4 0  
5 0 0.1  
5 1 0.85  
5 2 0.02  
5 3 0.15  
5 4 0.1  
5 5 0  
6 0 1  
6 1 1   
6 2 1  
6 3 1  

结果显示:全部归类为一个简单的聚类",而这不是聚类,并且我的数据中肯定有离群值.

the results say : "all in one trivial clustering", while this is not clustering and there definitely are outliers in my data.

我做正确的事吗?还是我想念什么?

do I do the stuff right ? Or what am I missing ?

推荐答案

在使用DBIDRangeDatabaseConnection且未提供ELKI任何实际数据时,可视化无法产生特别有用的结果(因为它不会毕竟没有实际数据).也无法自动评估数据.

When using DBIDRangeDatabaseConnection, and not giving ELKI any actual data, the visualization cannot produce a particularly useful result (because it doesn't have the actual data, after all). Nor can the data be evaluated automatically.

所有琐碎的聚类"是自动尝试可视化数据的人工产物,但是由于上述原因,这是行不通的.将自动为未标记的数据添加此群集,以使某些可视化工作.

The "all in one trivial clustering" is an artifact from the automatic attempts to visualize the data, but for the reasons discussed above this cannot work. This clustering is automatically added for unlabeled data, to allow some visualizations to work.

有两件事情要做:

  1. 设置输出处理程序.例如-resulthandler ResultWriter,它将产生类似于以下内容的输出:

  1. set an output handler. For example -resulthandler ResultWriter, which will produce an output similar to this:

ID=0 lof-outlier=1.0

其中ID=是对象编号,lof-outlier=是LOF离群值.

Where ID= is the object number, and lof-outlier= is the LOF outlier score.

或者,您可以实现自己的输出处理程序.在此处找到一个示例: http://elki.dbs.ifi .lmu.de/browser/elki/trunk/src/tutorial/outlier/SimpleScoreDumper.java

Alternatively, you can implement your own output handler. An example is found here: http://elki.dbs.ifi.lmu.de/browser/elki/trunk/src/tutorial/outlier/SimpleScoreDumper.java

修复DBIDRangeDatabaseConnection.但是,您会被ELKI 0.6.0〜beta1中的错误所困扰:DBIDRangeDatabaseConnection实际上没有正确初始化其参数. 简单的错误修复(在构造函数中未正确初始化参数)在这里:

fix DBIDRangeDatabaseConnection. You are however bitten by a bug in ELKI 0.6.0~beta1: the DBIDRangeDatabaseConnection actually doesn't initialize its parameters correctly. The trivial bug fix (parameters not initialized correctly in the constructor) is here:

http://elki.dbs.ifi.lmu.de/changeset/11027/elki

或者,您可以创建一个虚拟输入文件并使用常规文本输入.包含

Alternatively, you can create a dummy input file and use the regular text input. A file containing

0
1
2
...

应该可以解决问题.使用-dbc.in numbers100.txt -dbc.filter FixedDBIDsFilter -dbc.startid 0.后面的参数是让您的ID开头为0,而不是1(默认值).

should do the trick. Use -dbc.in numbers100.txt -dbc.filter FixedDBIDsFilter -dbc.startid 0. The latter arguments are to have your IDs start at 0, not 1 (default).

此替代方法将产生略有不同的输出格式:

This workaround will produce a slightly different output format:

ID=0 0.0 lof-outlier=1.0

其中附加列来自虚拟文件.当使用外部距离函数时,虚拟值不会影响LOF的算法结果.但是这种方法会占用一些额外的内存.

where the additional column is from the dummy file. The dummy values will not affect the algorithm result of LOF, when an external distance function is used; but this approach will use some additional memory.

这篇关于ELKI-输入距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆