使用逻辑回归的特征选择 [英] feature selection using logistic regression

查看:266
本文介绍了使用逻辑回归的特征选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Logistic回归进行特征选择(在具有1,930,388行和88个特征的数据集上).如果我根据保留的数据测试模型,则准确性仅略高于60%.响应变量平均分配.我的问题是,如果模型的性能不好,我是否可以将模型提供的特征视为实际的重要特征?还是我应该尝试提高模型的准确性,尽管我的最终目标不是提高准确性,而只是获得重要的功能

I am performing feature selection ( on a dataset with 1,930,388 rows and 88 features) using Logistic Regression. If I test the model on held-out data, the accuracy is just above 60%. The response variable is equally distributed. My question is, if the model's performance is not good, can I consider the features that it gives as actual important features? Or should I try to improve the accuracy of the model though my end-goal is not to improve the accuracy but only get important features

推荐答案

sklearn的GridSearchCV具有一些非常简洁的方法,可为您提供最佳功能集.例如,考虑以下代码

sklearn's GridSearchCV has some pretty neat methods to give you the best feature set. For example, consider the following code

pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
    ('clf', LogisticRegression())
    ])

    parameters = {
        'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
        'vect__ngram_range': ((1, 1), (1, 2), (2,3), (1,3), (1,4), (1,5)),
        'vect__use_idf': (True, False),
        'clf__C': (0.1, 1, 10, 20, 30)
    }

在这里,参数数组保存了我需要考虑的所有不同参数.注意vect__max_df的用法. max_df是我的矢量化程序使用的实际键,这是我的功能选择器.因此,

here the parameters array holds all of the different parameters that i need to consider. notice the use if vect__max_df. max_df is an actual key that is used by my vectorizer, which is my feature selector. So,

'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),

实际上指定我想为我的矢量化器尝试以上5个值.其他人也一样.请注意,我是如何将矢量化器与键"vect"绑定在一起,将分类器与键"clf"绑定在一起的.你看到图案了吗?继续

actually specifies that i want to try out the above 5 values for my vectorizer. Similarly for the others. Notice how i have tied my vectorizer to the key 'vect' and my classifier to the key 'clf'. Can you see the pattern? Moving on

    traindf = pd.read_json('../../data/train.json')

    traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]  

    traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]       

    X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()

    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

    grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    print ('best score: %0.3f' % grid_search.best_score_)
    print ('best parameters set:')

    bestParameters = grid_search.best_estimator_.get_params()

    for param_name in sorted(parameters.keys()):
        print ('\t %s: %r' % (param_name, bestParameters[param_name]))

    predictions = grid_search.predict(X_test)
    print ('Accuracy:', accuracy_score(y_test, predictions))
    print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
    print ('Classification Report:', classification_report(y_test, predictions))

请注意,在创建管道时,在我指定的所有选项中,bestParameters数组将为我提供最佳的参数集.

note that the bestParameters array will give me the best set of parameters out of all the options that i specified while creating my pipeline.

希望这会有所帮助.

获取所选功能的列表

因此,一旦有了最好的一组参数,就可以使用这些参数值创建矢量化器和分类器

so once you have your best set of parameters, create vectorizers and classifiers with those parameter values

vect = TfidfVectorizer('''use the best parameters here''')

然后,您基本上会再次训练此向量化器.为此,矢量化器将从您的训练集中选择某些功能.

then you basically train this vectorizer again. in doing so, the vectorizer will choose certain features from your training set.

traindf = pd.read_json('../../data/train.json')

        traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]  

        traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]       

        X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()

        X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

       termDocMatrix = vect.fit_transform(X_train, y_train)

现在,termDocMatrix具有所有选定的功能.另外,您可以使用矢量化器获取要素名称.假设您要获得前100个功能.而您比较的指标就是卡方得分

now, the termDocMatrix has all of the selected features. also, you can use the vectorizer to get the feature names. lets say you want to get the top 100 features. and your metric for comparison is the chi square score

getKbest = SelectKBest(chi2, k = 100)

现在

print(np.asarray(vect.get_feature_names())[getKbest.get_support()])

应该为您提供前100个功能.试试这个.

should give you the top 100 features. try this.

这篇关于使用逻辑回归的特征选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆