Sklearn - 从逻辑回归中返回前 3 个类 [英] Sklearn - Return top 3 classes from Logistic Regression

查看:44
本文介绍了Sklearn - 从逻辑回归中返回前 3 个类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个模型,该模型将客户电子邮件分类(案例原因").我清理了停用词等,并测试了几个不同的模型,逻辑回归是最准确的.问题是它只有大约 70% 的时间是准确的.这主要是因为数据的扩展问题(有少数案例原因导致了大部分电子邮件.

I am trying to create a model that will categorize customer emails into categories ("Case Reasons"). I have cleaned up stop words, etc. and have tested a few different models and Logistic Regression is the most accurate. The issue is that it is only accurate about 70% of the time. This is largely because of scaling issues with the data (there are a handful of case reasons that get the majority of the emails.

我想尝试给代理提供前 3 个(或者可能是 5 个)可供选择的结果,而不是仅仅预测一个结果.

Instead of just predicting a single outcome, I would like to try giving the agents the top 3 (or perhaps 5) to choose from.

这是我已经拥有的:

# vectorize the text
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', 
ngram_range=(1, 2), stop_words=internal_stop_words)

features = tfidf.fit_transform(df.Description).toarray()
labels = df.category_id
features.shape

在我对所有内容进行矢量化之后,我通过以下块运行它以测试 4 个模型中哪一个最适合.这表明 Logistic 回归为 70%,并且是四个中最好的:

After I vectorized everything, I ran it through the following block to test which of 4 models provided the best fit. This is what showed that Logistic Regression was at 70% and the best of the four:

models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state=0),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies):
    entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

我创建了分类器,它是传递值的功能:

I created the classifier and it is function to pass through values:

X_train, X_test, y_train, y_test = train_test_split(df['Description'], df['Reason'], 
       random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = LogisticRegression(solver='saga',multi_class='multinomial').fit(X_train_tfidf, y_train)

print(clf.predict(count_vect.transform(["""i dont know my password"""])))

['Reason #1']

在这种情况下,这不是正确的原因.我可以运行以下命令来获得一个显示每个分类概率的表格:

In this case, this isn't the correct reason. I can run the following to get a table that shows the probabilities of each classification:

#Test log res
probs = clf.predict_proba(count_vect.transform(["""I dont know my password"""]))
classes = clf.classes_

probs.shape = (len(category_to_id),1)
output = pd.DataFrame(data=[classes,probs]).T
output.columns= ['reason','prob']
output.sort_values(by='prob', ascending=False)

返回:

index       reason        prob
7           Reason #7     [0.6036937161535804]
6           Reason #6     [0.1576980112870697]
3           Reason #3     [0.13221805369421305]
13          Reason #13    [0.028848040868062686]
8           Reason #8     [0.02264491874676607]
9           Reason #9     [0.01725043255540921]
0           Reason #0     [0.01600640516713904]
10          Reason #10    [0.005444588928021622]
4           Reason #4     [0.0052240828713529894]
5           Reason #5     [0.0048409867159243045]
2           Reason #2     [0.0024794864823573935]
1           Reason #1     [0.0014065266971805264]
11          Reason #11    [0.001393613395266496]
12          Reason #12    [0.0008511364376563769]

所以我按照最可能的原因进行排序,在这种情况下,#3 是正确答案.

so I'm sorting by the most likely Reasons and in this case, #3 is the correct answer.

如何将前 N 个结果返回到输入,以及测试 N 个结果之一中存在的实际原因的模型准确性?

How can I return the top N results to the input, as well as test the model accuracy of the actual reason being present in one of the N results?

推荐答案

您可以按降序对概率进行排序并检索前 n 个.要计算准确度,您可以定义自定义函数,如果您的 y_true 属于 top-n,则该函数将认为预测是正确的.沿着这些路线的东西应该可以工作:

You can sort your probabilities in descending order and retrieve the top-n. To calculate accuracy, you can define your custom function that will consider a prediction to be correct if your y_true belongs in top-n. Something along these lines should work:

probs = clf.predict_proba(X_test)
# Sort desc and only extract the top-n
top_n = np.argsort(probs)[:,:-n-1:-1]

# Calculate accuracy
true_preds = 0
for i in range(len(y_test)):
    if y_test[i] in top_n[i]:
        true_preds += 1

accuracy = true_preds/len(y_test)

这篇关于Sklearn - 从逻辑回归中返回前 3 个类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆