使用 GridSearchCV 和 IsolationForest 查找异常值 [英] Using GridSearchCV with IsolationForest for finding outliers
问题描述
我想使用 IsolationForest
来查找异常值.我想用 GridSearchCV
找到模型的最佳参数.问题是我总是得到同样的错误:
TypeError:如果未指定评分,则传递的估算器应具有评分"方法.估计器 IsolationForest(behaviour='old', bootstrap=False, 污染='legacy',max_features=1.0, max_samples='auto', n_estimators=100,n_jobs=None,random_state=None,verbose=0,warm_start=False)没有.
这似乎是一个问题,因为 IsolationForest
没有 score
方法.有没有办法解决这个问题?还有没有办法找到隔离森林的分数?这是我的代码:
将pandas导入为pd从 sklearn.ensemble 导入 IsolationForest从 sklearn.model_selection 导入 GridSearchCVdf = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],'第二':[42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],'第三':[3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],'结果':[5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})x = df.iloc[:,:-1]调谐 = {'n_estimators':[70,80,100,120,150,200], 'max_samples':['auto', 1,3,5,7,10],'污染':['遗留', 'outo'], 'max_features':[1,2,3,4,5,6,7,8,9,10,13,15],'bootstrap':[True,False], 'n_jobs':[None,1,2,3,4,5,6,7,8,10,15,20,25,30], 'behaviour':['老新'],'random_state':[None,1,5,10,42], 'verbose':[0,1,2,3,4,5,6,7,8,9,10], 'warm_start':[真,错误的]}隔离森林 = GridSearchCV(隔离森林(),调整)模型=isolation_forest.fit(x)list_of_val = [[1,35,3], [3,4,5], [1,4,66], [4,6,1], [135,5,0]]df['异常值'] = model.predict(x)df['outliers'] = df['outliers'].map({-1: 'outlier', 1: 'good'})打印(model.best_params_)打印(df)
您需要创建自己的评分函数,因为 IsolationForest
没有内置 score
方法.相反,您可以使用 IsolationForest
中提供的 score_samples
函数(可以视为 score
的代理)并创建您自己的评分器如此处所述,并将其传递给GridSearchCV
.我已修改您的代码来执行此操作:
将pandas导入为pd将 numpy 导入为 np从 sklearn.ensemble 导入 IsolationForest从 sklearn.model_selection 导入 GridSearchCVdf = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],'第二':[42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],'第三':[3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],'结果':[5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})x = df.iloc[:,:-1]调谐 = {'n_estimators':[70,80], 'max_samples':['auto'],'污染':['遗留'],'max_features':[1],'bootstrap':[True], 'n_jobs':[None,1,2], 'behaviour':['old'],'random_state':[None,1,], 'verbose':[0,1,2], 'warm_start':[True]}def scorer_f(estimator, X): #你自己的得分手返回 np.mean(estimator.score_samples(X))#或者你可以使用如下所示的 lambda 表达式#scorer = lambda est,数据:np.mean(est.score_samples(data))isolation_forest = GridSearchCV(IsolationForest(),已调整,评分=scorer_f)模型=isolation_forest.fit(x)
<块引用>
样本输出
print(model.best_params_){'行为':'旧','引导':是的,'污染':'遗产','max_features': 1,'max_samples': '自动','n_estimators': 70,'n_jobs':无,'random_state':无,'详细':1,'warm_start':真}
希望这会有所帮助!
I want to use IsolationForest
for finding outliers. I want to find the best parameters for model with GridSearchCV
. The problem is that I always get the same error:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
max_features=1.0, max_samples='auto', n_estimators=100,
n_jobs=None, random_state=None, verbose=0, warm_start=False) does not.
It seems like its a problem because IsolationForest
does not have score
method.
Is there a way to fix this?
Also is there a way to find a score for isolation forest?
This is my code:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],
'second': [42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],
'third': [3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],
'result': [5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})
x = df.iloc[:,:-1]
tuned = {'n_estimators':[70,80,100,120,150,200], 'max_samples':['auto', 1,3,5,7,10],
'contamination':['legacy', 'outo'], 'max_features':[1,2,3,4,5,6,7,8,9,10,13,15],
'bootstrap':[True,False], 'n_jobs':[None,1,2,3,4,5,6,7,8,10,15,20,25,30], 'behaviour':['old', 'new'],
'random_state':[None,1,5,10,42], 'verbose':[0,1,2,3,4,5,6,7,8,9,10], 'warm_start':[True,False]}
isolation_forest = GridSearchCV(IsolationForest(), tuned)
model = isolation_forest.fit(x)
list_of_val = [[1,35,3], [3,4,5], [1,4,66], [4,6,1], [135,5,0]]
df['outliers'] = model.predict(x)
df['outliers'] = df['outliers'].map({-1: 'outlier', 1: 'good'})
print(model.best_params_)
print(df)
You need to create your own scoring function since IsolationForest
does not have score
method inbuilt. Instead you can make use of the score_samples
function that is available in IsolationForest
(can be considered as a proxy for score
) and create your own scorer as described here and pass it to the GridSearchCV
. I have modified your code to do this:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],
'second': [42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],
'third': [3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],
'result': [5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})
x = df.iloc[:,:-1]
tuned = {'n_estimators':[70,80], 'max_samples':['auto'],
'contamination':['legacy'], 'max_features':[1],
'bootstrap':[True], 'n_jobs':[None,1,2], 'behaviour':['old'],
'random_state':[None,1,], 'verbose':[0,1,2], 'warm_start':[True]}
def scorer_f(estimator, X): #your own scorer
return np.mean(estimator.score_samples(X))
#or you could use a lambda aexpression as shown below
#scorer = lambda est, data: np.mean(est.score_samples(data))
isolation_forest = GridSearchCV(IsolationForest(), tuned, scoring=scorer_f)
model = isolation_forest.fit(x)
SAMPLE OUTPUT
print(model.best_params_)
{'behaviour': 'old',
'bootstrap': True,
'contamination': 'legacy',
'max_features': 1,
'max_samples': 'auto',
'n_estimators': 70,
'n_jobs': None,
'random_state': None,
'verbose': 1,
'warm_start': True}
Hope this helps!
这篇关于使用 GridSearchCV 和 IsolationForest 查找异常值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!