TfidfVectorizer.fit_transfrom和tfidf.transform有什么区别? [英] What is the difference between TfidfVectorizer.fit_transfrom and tfidf.transform?

查看:2384
本文介绍了TfidfVectorizer.fit_transfrom和tfidf.transform有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Tfidf.fit_transform中,我们仅使用参数X,而没有使用y来拟合数据集. 这是正确的吗? 我们只为训练集的参数生成tfidf矩阵,没有在模型拟合中使用ytrain. 那么我们如何对测试数据集进行预测

In Tfidf.fit_transform we are only using the parameters X and have not used y for fitting the data set. Is this right? We are generating the tfidf matrix for only parameters of the training set.We are not using ytrain in fitting the model. Then how do we make predictions for the test data set

推荐答案

https://datascience.stackexchange.com/a /12346/122 很好地解释了为什么它被称为fit()transform()fit_transform().

https://datascience.stackexchange.com/a/12346/122 has a good explanation of why it's call fit(), transform() and fit_transform().

要点

  • fit():将矢量化器/模型拟合到训练数据,并将矢量化器/模型保存到变量(返回sklearn.feature_extraction.text.TfidfVectorizer)

  • fit(): Fit the vectorizer/model to the training data and save the vectorizer/model to a variable (returns sklearn.feature_extraction.text.TfidfVectorizer)

transform():使用fit()的变量输出来转换验证/测试数据(返回scipy.sparse.csr.csr_matrix)

transform(): Use the variable output from fit() to transformer validation/test data (returns scipy.sparse.csr.csr_matrix)

fit_transform():有时您直接转换训练数据,因此您同时使用了fit() + transform(),因此也使用了fit_transform(). (返回scipy.sparse.csr.csr_matrix)

fit_transform(): Sometimes you to directly transform the training data, so you use fit() + transform() together, thus fit_transform(). (returns scipy.sparse.csr.csr_matrix)

例如

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix


# The *TfidfVectorizer* from sklearn expects list of strings as input.
sent0 = "The quick brown fox jumps over the lazy brown dog .".lower()
sent1 = "Mr brown jumps over the lazy fox .".lower()
sent2 = "Roses are red , the chocolates are brown .".lower()
sent3 = "The frank dog jumps through the red roses .".lower()

dataset = [sent0, sent1, sent2, sent3]

# Initialize the parameters of the vectorizer
vectorizer = TfidfVectorizer(input=dataset, analyzer='word', ngram_range=(1,1),
                     min_df = 0, stop_words=None)

[输出]:

# Learns the vocabulary of vectorizer based on the initialized parameter.
>>> vectorizer =  vectorizer.fit(dataset)

# Apply the vectorizer to new sentence.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."])
<1x15 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

# Output to array form.
>>> vectorizer.transform(["The brown roses jumps through the chocholate dog ."]).toarray()
array([[0.        , 0.31342551, 0.        , 0.38714286, 0.        ,
        0.        , 0.31342551, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.38714286, 0.51249178, 0.49104163]])

# When you don't need to save the vectorizer for re-using.
>>> vectorizer.fit_transform(dataset)
<4x15 sparse matrix of type '<class 'numpy.float64'>'
    with 28 stored elements in Compressed Sparse Row format>

>>> vectorizer.fit_transform(dataset).toarray()
array([[0.        , 0.49642852, 0.        , 0.30659399, 0.30659399,
        0.        , 0.24821426, 0.30659399, 0.        , 0.30659399,
        0.38887561, 0.        , 0.        , 0.40586285, 0.        ],
       [0.        , 0.32107915, 0.        , 0.        , 0.39659663,
        0.        , 0.32107915, 0.39659663, 0.50303254, 0.39659663,
        0.        , 0.        , 0.        , 0.26250325, 0.        ],
       [0.76012588, 0.24258925, 0.38006294, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.29964599, 0.29964599, 0.19833261, 0.        ],
       [0.        , 0.        , 0.        , 0.34049544, 0.        ,
        0.4318753 , 0.27566041, 0.        , 0.        , 0.        ,
        0.        , 0.34049544, 0.34049544, 0.45074089, 0.4318753 ]])


>>> type(vectorizer)
<class 'sklearn.feature_extraction.text.TfidfVectorizer'>

>>> type(vectorizer.fit_transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>

>>> type(vectorizer.transform(dataset))
<class 'scipy.sparse.csr.csr_matrix'>

这篇关于TfidfVectorizer.fit_transfrom和tfidf.transform有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆