使用 GridSearchCV 和 IsolationForest 查找异常值 [英] Using GridSearchCV with IsolationForest for finding outliers

查看:60
本文介绍了使用 GridSearchCV 和 IsolationForest 查找异常值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 IsolationForest 来查找异常值.我想用 GridSearchCV 找到模型的最佳参数.问题是我总是得到同样的错误:

TypeError:如果未指定评分,则传递的估算器应具有评分"方法.估计器 IsolationForest(behaviour='old', bootstrap=False, 污染='legacy',max_features=1.0, max_samples='auto', n_estimators=100,n_jobs=None,random_state=None,verbose=0,warm_start=False)没有.

这似乎是一个问题,因为 IsolationForest 没有 score 方法.有没有办法解决这个问题?还有没有办法找到隔离森林的分数?这是我的代码:

将pandas导入为pd从 sklearn.ensemble 导入 IsolationForest从 sklearn.model_selection 导入 GridSearchCVdf = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],'第二':[42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],'第三':[3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],'结果':[5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0​​,8,5,4,1]})x = df.iloc[:,:-1]调谐 = {'n_estimators':[70,80,100,120,150,200], 'max_samples':['auto', 1,3,5,7,10],'污染':['遗留', 'outo'], 'max_features':[1,2,3,4,5,6,7,8,9,10,13,15],'bootstrap':[True,False], 'n_jobs':[None,1,2,3,4,5,6,7,8,10,15,20,25,30], 'behaviour':['老新'],'random_state':[None,1,5,10,42], 'verbose':[0,1,2,3,4,5,6,7,8,9,10], 'warm_start':[真,错误的]}隔离森林 = GridSearchCV(隔离森林(),调整)模型=isolation_forest.fit(x)list_of_val = [[1,35,3], [3,4,5], [1,4,66], [4,6,1], [135,5,0]]df['异常值'] = model.predict(x)df['outliers'] = df['outliers'].map({-1: 'outlier', 1: 'good'})打印(model.best_params_)打印(df)

解决方案

您需要创建自己的评分函数,因为 IsolationForest 没有内置 score 方法.相反,您可以使用 IsolationForest 中提供的 score_samples 函数(可以视为 score 的代理)并创建您自己的评分器如此处所述,并将其传递给GridSearchCV.我已修改您的代码来执行此操作:

将pandas导入为pd将 numpy 导入为 np从 sklearn.ensemble 导入 IsolationForest从 sklearn.model_selection 导入 GridSearchCVdf = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],'第二':[42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],'第三':[3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],'结果':[5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0​​,8,5,4,1]})x = df.iloc[:,:-1]调谐 = {'n_estimators':[70,80], 'max_samples':['auto'],'污染':['遗留'],'max_features':[1],'bootstrap':[True], 'n_jobs':[None,1,2], 'behaviour':['old'],'random_state':[None,1,], 'verbose':[0,1,2], 'warm_start':[True]}def scorer_f(estimator, X): #你自己的得分手返回 np.mean(estimator.score_samples(X))#或者你可以使用如下所示的 lambda 表达式#scorer = lambda est,数据:np.mean(est.score_samples(data))isolation_forest = GridSearchCV(IsolationForest(),已调整,评分=scorer_f)模型=isolation_forest.fit(x)

<块引用>

样本输出

print(model.best_params_){'行为':'旧','引导':是的,'污染':'遗产','max_features': 1,'max_samples': '自动','n_estimators': 70,'n_jobs':无,'random_state':无,'详细':1,'warm_start':真}

希望这会有所帮助!

I want to use IsolationForest for finding outliers. I want to find the best parameters for model with GridSearchCV. The problem is that I always get the same error:

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
                max_features=1.0, max_samples='auto', n_estimators=100,
                n_jobs=None, random_state=None, verbose=0, warm_start=False) does not.

It seems like its a problem because IsolationForest does not have score method. Is there a way to fix this? Also is there a way to find a score for isolation forest? This is my code:

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV

df = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],
                   'second': [42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],
                   'third': [3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],
                   'result': [5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})

x = df.iloc[:,:-1]

tuned = {'n_estimators':[70,80,100,120,150,200], 'max_samples':['auto', 1,3,5,7,10],
         'contamination':['legacy', 'outo'], 'max_features':[1,2,3,4,5,6,7,8,9,10,13,15],
         'bootstrap':[True,False], 'n_jobs':[None,1,2,3,4,5,6,7,8,10,15,20,25,30], 'behaviour':['old', 'new'],
         'random_state':[None,1,5,10,42], 'verbose':[0,1,2,3,4,5,6,7,8,9,10], 'warm_start':[True,False]}

isolation_forest = GridSearchCV(IsolationForest(), tuned)

model = isolation_forest.fit(x)

list_of_val = [[1,35,3], [3,4,5], [1,4,66], [4,6,1], [135,5,0]]
df['outliers'] = model.predict(x)
df['outliers'] = df['outliers'].map({-1: 'outlier', 1: 'good'})

print(model.best_params_)
print(df)

解决方案

You need to create your own scoring function since IsolationForest does not have score method inbuilt. Instead you can make use of the score_samples function that is available in IsolationForest (can be considered as a proxy for score) and create your own scorer as described here and pass it to the GridSearchCV. I have modified your code to do this:

import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV

df = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],
                   'second': [42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],
                   'third': [3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],
                   'result': [5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})

x = df.iloc[:,:-1]

tuned = {'n_estimators':[70,80], 'max_samples':['auto'],
     'contamination':['legacy'], 'max_features':[1],
     'bootstrap':[True], 'n_jobs':[None,1,2], 'behaviour':['old'],
     'random_state':[None,1,], 'verbose':[0,1,2], 'warm_start':[True]}  

def scorer_f(estimator, X):   #your own scorer
      return np.mean(estimator.score_samples(X))

#or you could use a lambda aexpression as shown below
#scorer = lambda est, data: np.mean(est.score_samples(data)) 

isolation_forest = GridSearchCV(IsolationForest(), tuned, scoring=scorer_f)
model = isolation_forest.fit(x)

SAMPLE OUTPUT

print(model.best_params_)

{'behaviour': 'old',
 'bootstrap': True,
 'contamination': 'legacy',
 'max_features': 1,
 'max_samples': 'auto',
 'n_estimators': 70,
 'n_jobs': None,
 'random_state': None,
 'verbose': 1,
 'warm_start': True}

Hope this helps!

这篇关于使用 GridSearchCV 和 IsolationForest 查找异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆