scikit-learn 中处理 nan/null 的分类器 [英] classifiers in scikit-learn that handle nan/null
问题描述
我想知道在 scikit-learn 中是否有处理 nan/null 值的分类器.我认为随机森林回归器可以处理这个问题,但是当我调用 predict
时出现错误.
I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought random forest regressor handles this but I got an error when I call predict
.
X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]])
y_train = np.array([1, 2])
clf = RandomForestRegressor(X_train, y_train)
X_test = np.array([7, 8, np.nan])
y_pred = clf.predict(X_test) # Fails!
我可以不使用任何带有缺失值的 scikit-learn 算法调用 predict 吗?
Can I not call predict with any scikit-learn algorithm with missing values?
编辑.现在回想起来,觉得很有道理.这在训练期间不是问题,但是当您预测变量为空时如何进行分支时?也许您可以将两种方式分开并平均结果?只要距离函数忽略空值,k-NN 似乎应该可以正常工作.
Edit. Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.
编辑 2(更老更聪明的我)一些 gbm 库(例如 xgboost)正是为此目的使用三叉树而不是二叉树:2 个孩子用于是/否决定,1 个孩子用于丢失决定.sklearn 是使用二叉树
Edit 2 (older and wiser me) Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree
推荐答案
我做了一个例子,其中包含训练集和测试集的缺失值
I made an example that contains both missing values in training and the test sets
我刚刚选择了一种使用SimpleImputer
类用平均值替换缺失数据的策略.还有其他策略.
I just picked a strategy to replace missing data with the mean, using the SimpleImputer
class. There are other strategies.
from __future__ import print_function
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
X_train = [[0, 0, np.nan], [np.nan, 1, 1]]
Y_train = [0, 1]
X_test_1 = [0, 0, np.nan]
X_test_2 = [0, np.nan, np.nan]
X_test_3 = [np.nan, 1, 1]
# Create our imputer to replace missing values with the mean e.g.
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X_train)
# Impute our data, then train
X_train_imp = imp.transform(X_train)
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train_imp, Y_train)
for X_test in [X_test_1, X_test_2, X_test_3]:
# Impute each test item, then predict
X_test_imp = imp.transform(X_test)
print(X_test, '->', clf.predict(X_test_imp))
# Results
[0, 0, nan] -> [0]
[0, nan, nan] -> [0]
[nan, 1, 1] -> [1]
这篇关于scikit-learn 中处理 nan/null 的分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!