如何使用隔离森林 [英] How to use Isolation Forest

查看:14
本文介绍了如何使用隔离森林的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试检测数据集的异常值,我发现 sklearn 的 隔离森林.我无法理解如何使用它.我将我的训练数据放入其中,它返回一个带有 -1 和 1 值的向量.

I am trying to detect the outliers to my dataset and I find the sklearn's Isolation Forest. I can't understand how to work with it. I fit my training data in it and it gives me back a vector with -1 and 1 values.

谁能向我解释它是如何工作的并提供一个例子?

Can anyone explain to me how it works and provide an example?

我怎么知道异常值是真正的"异常值?

How can I know that the outliers are 'real' outliers?

调整参数?

这是我的代码:

clf = IsolationForest(max_samples=10000, random_state=10)
clf.fit(x_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)

[1 1 1 ..., -1 1 1]

推荐答案

看来你的问题很多,我尽量一一解答.

It seems you have many questions, let me try to answer them one by one to the best of my knowledge.

它是如何工作的?

之所以有效,是因为任何数据集中的异常值(即异常值)的性质很少且不同,这与典型的基于聚类或基于距离的算法有很大不同.在顶层,它适用于这样的逻辑,即与任何数据集中的正常"点相比,异常值需要更少的步骤来隔离".为此,这就是 IF 所做的;假设您有 n 个数据点的训练数据集 X,每个数据点具有 m 个特征.在训练中,IF 为不同的特征创建隔离树(二叉搜索树).

It works due to the fact that the nature of outliers in any data set, which is outliers, is few and different, which is quite different from the typical clustering-based or distance-based algorithm. At the top level, it works on the logic that outliers take fewer steps to 'isolate' compare to the 'normal' point in any data set. To do so, this is what IF does; suppose you have training data set X with n data points, each having m features. In training, IF creates Isolation trees (Binary search trees) for different features.

对于训练,您在 train 阶段有 3 个用于调整的参数:

For training, you have 3 parameters for tuning during the train phase:

  1. 隔离树的数量(sklearn_IsolationForest 中的n_estimators)
  2. 样本数(sklearn_IsolationForest 中的max_samples)
  3. 从 X 中提取的特征数量来训练每个基本估计器(sklearn_IF 中的max_features).

max_samples 是它将从原始数据集中挑选用于创建隔离树的随机样本数.

max_samples is the number of random samples it will pick from the original data set for creating Isolation trees.

测试阶段:

  • sklearn_IF 从所有经过训练的隔离树中找出被测数据点的路径长度,并找出平均路径长度.路径长度越大,点越法线,反之亦然.

  • sklearn_IF finds the path length of data point under test from all the trained Isolation Trees and finds the average path length. The higher the path length, the more normal the point, and vice-versa.

基于平均路径长度.它计算异常分数,可以使用 sklearn_IF 的 decision_function 来获得它.对于sklearn_IF,分数越低,样本越异常.

Based on the average path length. It calculates the anomaly score, decision_function of sklearn_IF can be used to get this. For sklearn_IF, the lower the score, the more anomalous the sample.

根据异常分数,您可以通过在 sklearn_IF 对象中设置适当的污染值来判断给定样本是否异常.污染的默认值为0.1,您可以调整它来决定阈值.数据集的污染量,即数据集中异常值的比例.

Based on the anomaly score, you can decide whether the given sample is anomalous or not by setting the proper value of contamination in the sklearn_IF object. The default value of contamination is 0.1, which you can tune for deciding the threshold. The amount of contamination of the data set, i.e., the proportion of outliers in the data set.

调整参数

培训 ->n_estimatorsmax_samplesmax_features.

测试 ->污染

这篇关于如何使用隔离森林的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆