ValueError:sklearn.RFECV不支持未知 [英] ValueError: unknown is not supported in sklearn.RFECV
问题描述
我正在尝试缩小与我使用rfecv分类器真正相关的功能的数量.这是我编写的代码
I was trying to narrow down the number of features really relevant for my classifier using rfecv. This is the code I have written
import sklearn
import pandas as p
import numpy as np
import scipy as sp
import pylab as pl
from sklearn import linear_model, cross_validation, metrics
from sklearn.svm import SVC
from sklearn.feature_selection import RFECV
from sklearn.metrics import zero_one_loss
from sklearn import preprocessing
#from sklearn.feature_extraction.text import CountVectorizer
#from sklearn.feature_selection import SelectKBest, chi2
modelType = "notext"
# ----------------------------------------------------------
# Prepare the Data
# ----------------------------------------------------------
training_data = np.array(p.read_table('F:/NYC/NYU/SM/3/SNLP/Project/Data/train.tsv'))
print ("Read Data\n")
# get the target variable and set it as Y so we can predict it
Y = training_data[:,-1]
print(Y)
# not all data is numerical, so we'll have to convert those fields
# fix "is_news":
training_data[:,17] = [0 if x == "?" else 1 for x in training_data[:,17]]
# fix -1 entries in hasDomainLink
training_data[:,14] = [0 if x =="-1" else x for x in training_data[:,10]]
# fix "news_front_page":
training_data[:,20] = [999 if x == "?" else x for x in training_data[:,20]]
training_data[:,20] = [1 if x == "1" else x for x in training_data[:,20]]
training_data[:,20] = [0 if x == "0" else x for x in training_data[:,20]]
# fix "alchemy category":
training_data[:,3] = [0 if x=="arts_entertainment" else x for x in training_data[:,3]]
training_data[:,3] = [1 if x=="business" else x for x in training_data[:,3]]
training_data[:,3] = [2 if x=="computer_internet" else x for x in training_data[:,3]]
training_data[:,3] = [3 if x=="culture_politics" else x for x in training_data[:,3]]
training_data[:,3] = [4 if x=="gaming" else x for x in training_data[:,3]]
training_data[:,3] = [5 if x=="health" else x for x in training_data[:,3]]
training_data[:,3] = [6 if x=="law_crime" else x for x in training_data[:,3]]
training_data[:,3] = [7 if x=="recreation" else x for x in training_data[:,3]]
training_data[:,3] = [8 if x=="religion" else x for x in training_data[:,3]]
training_data[:,3] = [9 if x=="science_technology" else x for x in training_data[:,3]]
training_data[:,3] = [10 if x=="sports" else x for x in training_data[:,3]]
training_data[:,3] = [11 if x=="unknown" else x for x in training_data[:,3]]
training_data[:,3] = [12 if x=="weather" else x for x in training_data[:,3]]
training_data[:,3] = [999 if x=="?" else x for x in training_data[:,3]]
print ("Corrected outliers data\n")
# ----------------------------------------------------------
# Models
# ----------------------------------------------------------
if modelType == "notext":
print ("no text model\n")
#ignore features which are useless
X = training_data[:,list([3, 5, 6, 7, 8, 9, 10, 14, 15, 16, 17, 19, 20, 22, 25])]
scaler = preprocessing.StandardScaler()
print("initialized scaler \n")
scaler.fit(X,Y)
print("fitted train data and labels\n")
X = scaler.transform(X)
print("Transformed train data\n")
svc = SVC(kernel = "linear")
print("Initialized SVM\n")
rfecv = RFECV(estimator = svc, cv = 5, loss_func = zero_one_loss, verbose = 1)
print("Initialized RFECV\n")
rfecv.fit(X,Y)
print("Fitted train data and label\n")
rfecv.support_
print ("Optimal Number of features : %d" % rfecv.n_features_)
savetxt('rfecv.csv', rfecv.ranking_, delimiter=',', fmt='%f')
在调用"rfecv.fit(X,Y)"时,我的代码从metrices.py文件"ValueError:不支持未知"引发错误.
At call of "rfecv.fit(X,Y)" my code throws an error from the metrices.py file "ValueError: unknown is not supported"
错误在sklearn.metrics.metrics
出现:
# No metrics support "multiclass-multioutput" format
if (y_type not in ["binary", "multiclass", "multilabel-indicator", "multilabel-sequences"]):
raise ValueError("{0} is not supported".format(y_type))
这是一个分类问题,目标值仅为0或1. 可以在凝视竞争数据
This is a classification problem, target values only 0 or 1. The data set can be found at Kaggle Competition Data
如果任何人都可以指出我要去哪里,我将不胜感激.
If anyone can point out where I am going wrong, I would appreciate it.
推荐答案
RFECV
检查目标/训练数据是否为binary
,multiclass
,multilabel-indicator
或multilabel-sequences
类型之一:
RFECV
checks target/train data to be of one of types binary
, multiclass
, multilabel-indicator
or multilabel-sequences
:
- 'binary':
y
包含< = 2个离散值,为1d或一列 向量. - 'multiclass':
y
包含两个以上的离散值,不是 序列的序列,并且是1d或列向量. - 'mutliclass-multioutput':
y
是包含更多内容的2d数组 大于两个离散值,不是序列序列,并且两者 尺寸大于1. - 'multilabel-indicator':
y
是标签指示器矩阵,一个数组 至少有两列的两个维度,最多2个唯一 值.
- 'binary':
y
contains <= 2 discrete values and is 1d or a column vector. - 'multiclass':
y
contains more than two discrete values, is not a sequence of sequences, and is 1d or a column vector. - 'mutliclass-multioutput':
y
is a 2d array that contains more than two discrete values, is not a sequence of sequences, and both dimensions are of size > 1. - 'multilabel-indicator':
y
is a label indicator matrix, an array of two dimensions with at least two columns, and at most 2 unique values.
当您的Y
是unknown
时,即
- 未知":
y
类似于数组,但以上都不是,例如3d数组或非序列对象的数组.
- 'unknown':
y
is array-like but none of the above, such as a 3d array, or an array of non-sequence objects.
原因是您的目标数据是字符串(格式为"0"
和"1"
),并以read_table
作为对象加载:
The reason for that is your target data is string (of form "0"
and "1"
) and is loaded with read_table
as object:
>>> training_data[:, -1].dtype
dtype('O')
>>> type_of_target(training_data[:, -1])
'unknown'
要解决此问题,您可以转换为int
:
To solve the issue, you can convert to int
:
>>> Y = training_data[:, -1].astype(int)
>>> type_of_target(Y)
'binary'
这篇关于ValueError:sklearn.RFECV不支持未知的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!