scikit-learn 中处理 nan/null 的分类器 [英] classifiers in scikit-learn that handle nan/null

查看:37
本文介绍了scikit-learn 中处理 nan/null 的分类器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道在 scikit-learn 中是否有处理 nan/null 值的分类器.我认为随机森林回归器可以处理这个问题,但是当我调用 predict 时出现错误.

I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought random forest regressor handles this but I got an error when I call predict.

X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]])
y_train = np.array([1, 2])
clf = RandomForestRegressor(X_train, y_train)
X_test = np.array([7, 8, np.nan])
y_pred = clf.predict(X_test) # Fails!

我可以不使用任何带有缺失值的 scikit-learn 算法调用 predict 吗?

Can I not call predict with any scikit-learn algorithm with missing values?

编辑.现在回想起来,觉得很有道理.这在训练期间不是问题,但是当您预测变量为空时如何进行分支时?也许您可以将两种方式分开并平均结果?只要距离函数忽略空值,k-NN 似乎应该可以正常工作.

Edit. Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.

编辑 2(更老更聪明的我)一些 gbm 库(例如 xgboost)正是为此目的使用三叉树而不是二叉树:2 个孩子用于是/否决定,1 个孩子用于丢失决定.sklearn 是使用二叉树

Edit 2 (older and wiser me) Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree

推荐答案

我做了一个例子,其中包含训练集和测试集的缺失值

I made an example that contains both missing values in training and the test sets

我刚刚选择了一种使用SimpleImputer 类用平均值替换缺失数据的策略.还有其他策略.

I just picked a strategy to replace missing data with the mean, using the SimpleImputer class. There are other strategies.

from __future__ import print_function

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer


X_train = [[0, 0, np.nan], [np.nan, 1, 1]]
Y_train = [0, 1]
X_test_1 = [0, 0, np.nan]
X_test_2 = [0, np.nan, np.nan]
X_test_3 = [np.nan, 1, 1]

# Create our imputer to replace missing values with the mean e.g.
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X_train)

# Impute our data, then train
X_train_imp = imp.transform(X_train)
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train_imp, Y_train)

for X_test in [X_test_1, X_test_2, X_test_3]:
    # Impute each test item, then predict
    X_test_imp = imp.transform(X_test)
    print(X_test, '->', clf.predict(X_test_imp))

# Results
[0, 0, nan] -> [0]
[0, nan, nan] -> [0]
[nan, 1, 1] -> [1]

这篇关于scikit-learn 中处理 nan/null 的分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆