使用新数据测试文本分类 ML 模型失败 [英] Testing text classification ML model with new data fails

查看：21 发布时间：2021/12/25 14:49:36 python machine-learning scikit-learn nlp text-processing

本文介绍了使用新数据测试文本分类 ML 模型失败的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我建立了一个机器学习模型来将电子邮件分类为垃圾邮件.现在我想测试我自己的电子邮件并查看结果.所以我写了下面的代码来对新邮件进行分类:

message = """Subject: 你好，来自谷歌安全团队，我们想恢复你的密码.请联系我们尽快"消息 = pd.Series([消息,])转换消息 = CountVectorizer(analyzer=process_text).fit_transform(消息)proba = model.predict_proba(transformed_message)[0]

知道 process_text 是处理电子邮件的函数，当我运行代码时，我收到以下错误:

模型的特征数量必须与输入匹配.模型 n_features 为 37229，输入 n_features 为 13

请问有什么问题，我该如何解决?

解决方案

对于此类管道中的所有数据预处理步骤，我们不再适合，就像您在这里所做的那样您的(新定义的)计数向量化器.

因此，不要将 fit_transform 与新的计数向量化器一起使用，而应通过应用其 transform 方法.这将允许您的新数据与训练数据的 37229 个特征(用于训练模型)相关联，而不是在您再次将计数向量化器拟合到如此短的文本时产生的仅有 13 个特征.

I have built a machine learning model to classify emails as spams or not. Now i want to test my own email and see the result. So i wrote the following code to classify the new email:

message = """Subject: Hello this is from google security team we want to recover your password. Please contact us 
as soon as possible"""

message = pd.Series([message,])
transformed_message = CountVectorizer(analyzer=process_text).fit_transform(message)
proba = model.predict_proba(transformed_message)[0]

Knowing that process_text is a function to process the email, When I run the code i get the following error:

Number of features of the model must match the input. Model n_features is 37229 and input n_features is 13

What's the problem and how can i fix that please ?

解决方案

For all data preprocessing steps in such pipelines, we never fit again, as you do here with your (newly defined) count vectorizer.

So, instead of using fit_transform with a new count vectorizer, you should reuse the existing count vectorizer (i.e. the one used with your training data), by applying its transform method. That will allow your new data to be mapped in relation to the 37229 features of the training data (with which the model was trained), instead of the only 13 features produced when you fit again a count vectorizer to such a short text.

这篇关于使用新数据测试文本分类 ML 模型失败的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用新数据测试文本分类 ML 模型失败 [英] Testing text classification ML model with new data fails

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

使用新数据测试文本分类 ML 模型失败 [英] Testing text classification ML model with new data fails

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭