如何使用隔离林 [英] How to use Isolation Forest

查看:100
本文介绍了如何使用隔离林的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试检测数据集的异常值,并且找到了sklearn的

I am trying to detect the outliers to my dataset and I find the sklearn's Isolation Forest. I can't understand how to work with it. I fit my training data in it and it gives me back a vector with -1 and 1 values.

有人可以向我解释它的工作原理并提供示例吗?

Can anyone explain to me how it works and provide an example?

我怎么知道异常值是真实的"异常值?

How can I know that the outliers are 'real' outliers?

调整参数?

这是我的代码:

clf = IsolationForest(max_samples=10000, random_state=10)
clf.fit(x_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)

[1 1 1 ..., -1 1 1]

推荐答案

似乎您有很多问题,让我尝试尽我所能回答一个问题. - 怎么运行的? ->它基于以下事实:任何数据集中的异常值的性质都是很少且不同的",这与典型的基于聚类或基于距离的算法完全不同.在最高层次上,它适用于以下逻辑:与任何数据集中的正常"点相比,离群值采取较少的步骤来隔离". 为此,这就是IF所做的事情.假设您具有 X 训练数据集,其中每个数据点具有 m 个功能,分别包含 n 个数据点.在训练中,IF为不同的功能创建隔离树(二进制搜索树).对于训练,您有3个要调整的参数,一个是隔离树的数量(sklearn_IsolationForest中的'n_estimators'),第二个是样本数量(sklearn_IsolationForest中的'max_samples'),第三个是从X绘制以训练每个要素的要素数量基本估算器(sklearn_IF中的"max_features"). "max_sample"是将从原始数据集中选择的用于创建隔离树的随机样本数.

Seems you have many questions, let me try to answer one by one with best of my knowledge. - How it works? -> It works on the fact that the nature of outliers in any data set, which is outliers are 'few and different' which is quite different from the typical clustering based or distance based algorithm. At the top level it works on the logic that outliers takes less steps to 'isolate' compare to 'normal' point in any data set. To do so, this is what IF does, Suppose you have training data set X with n data points each having m features. In training, IF creates Isolation trees (Binary search trees) for different features. For training you have 3 parameters for tuning, one is number of isolation trees ('n_estimators' in sklearn_IsolationForest), second is number of samples ('max_samples' in sklearn_IsolationForest) and the third is the number of features to draw from X to train each base estimator ('max_features' in sklearn_IF). 'max_sample' is the number of random samples it will pick from the original data set for creating Isolation trees.

在测试阶段,它将从所有训练好的隔离树中找到被测数据点的路径长度,并找到平均路径长度.路径长度越大,该点越法线,反之亦然.基于平均路径长度.它会计算异常得分,可以使用sklearn_IF的 decision_function 来获取该得分.对于sklearn_IF,降低得分,使样本更加异常.根据异常分数,您可以通过在sklearn_IF对象中设置污染的适当值来确定给定样本是否异常.污染的默认值为0.1,您可以对其进行调整以决定阈值.数据集的污染程度,即数据集中异常值的比例.

During the test phase it finds the path length of data point under test from all the trained Isolation Trees and finds the average path length. Higher the path length, more normal the point and vice-versa. Based on the average path length. it calculates the anomaly score, decision_function of sklearn_IF can be used to get this. For sklearn_IF, lower the score, more anomalous the sample. Based on the anomaly score you can decide whether the given sample is anomalous or not by setting proper value of contamination in sklearn_IF object. default value of contamination is 0.1 which you can tune for deciding the threshold. The amount of contamination of the data set, i.e. the proportion of outliers in the data set.

调整参数 训练-> 1. n_estimators,2.max_samples,3.max_features. 测试-> 1.污染

Tuning parameters Training -> 1. n_estimators, 2. max_samples, 3.max_features. Testing -> 1. contamination

这篇关于如何使用隔离林的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆