在整个数据集上还是仅在训练数据上计算TF-IDF? [英] Computing TF-IDF on the whole dataset or only on training data?

查看:67
本文介绍了在整个数据集上还是仅在训练数据上计算TF-IDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在《 TensorFlow机器学习指南》这本书的第七章中,作者在预处理数据时使用scikit-learn的 fit_transform 功能来获取以下内容的 tfidf 功能培训文字.作者将所有文本数据提供给函数,然后将其分为训练和测试.这是真的吗?还是我们必须先分离数据,然后在训练中执行 fit_transform ,然后在测试中执行 transform ?

In the chapter seven of this book "TensorFlow Machine Learning Cookbook" the author in pre-processing data uses fit_transform function of scikit-learn to get the tfidf features of text for training. The author gives all text data to the function before separating it into train and test. Is it a true action or we must separate data first and then perform fit_transform on train and transform on test?

推荐答案

根据scikit-learn的文档,使用 fit()

According to the documentation of scikit-learn, fit() is used in order to

从训练集中学习词汇和idf.

Learn vocabulary and idf from training set.

另一方面,使用 fit_transform()

学习词汇和idf,返回术语文档矩阵.

Learn vocabulary and idf, return term-document matrix.

transform()

将文档转换为文档术语矩阵.

Transforms documents to document-term matrix.

在训练集上,您需要同时应用 fit() transform()(或仅将两者基本结合的 fit_transform())操作),但是,在测试集上,您只需要 transform()测试实例(即文档).

On the training set you need to apply both fit() and transform() (or just fit_transform() that essentially joins both operations) however, on the testing set you only need to transform() the testing instances (i.e. the documents).

请记住,训练集用于学习目的(通过 fit()实现学习),而测试集则用于评估训练后的模型是否可以很好地推广到新的未知视域数据点.

Remember that training sets are used for learning purposes (learning is achieved through fit()) while testing set is used in order to evaluate whether the trained model can generalise well to new unseen data points.

有关更多详细信息,请参阅文章

For more details you can refer to the article fit() vs transform() vs fit_transform()

这篇关于在整个数据集上还是仅在训练数据上计算TF-IDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆