将新文本添加到 Sklearn TFIDIF Vectorizer (Python) [英] Adding New Text to Sklearn TFIDIF Vectorizer (Python)

查看:40
本文介绍了将新文本添加到 Sklearn TFIDIF Vectorizer (Python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有添加到现有语料库的功能?我已经生成了我的矩阵,我希望定期添加到表格中,而无需重新处理整个 sha-bang

Is there a function to add to the existing corpus? I've already generated my matrix, I'm looking to periodically add to the table without re-crunching the whole sha-bang

例如;

articleList = ['here is some text blah blah','another text object', 'more foo for your bar right now']
tfidf_vectorizer = TfidfVectorizer(
                        max_df=.8,
                        max_features=2000,
                        min_df=.05,
                        preprocessor=prep_text,
                        use_idf=True,
                        tokenizer=tokenize_text
                    )
tfidf_matrix = tfidf_vectorizer.fit_transform(articleList)

#### ADDING A NEW ARTICLE TO EXISTING SET?
bigger_tfidf_matrix = tfidf_vectorizer.fit_transform(['the last article I wanted to add'])

推荐答案

你可以直接访问你的vectoriser的vocabulary_属性,你可以通过访问idf_矢量_tfidf._idf_diag,因此可以像这样进行猴子补丁:

You can access the vocabulary_ attribute of your vectoriser directly, and you can access the idf_ vector via _tfidf._idf_diag, so it would be possible to monkey-patch something like this:

import re 
import numpy as np
from scipy.sparse.dia import dia_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

def partial_fit(self, X):
    max_idx = max(self.vocabulary_.values())
    for a in X:
        #update vocabulary_
        if self.lowercase: a = a.lower()
        tokens = re.findall(self.token_pattern, a)
        for w in tokens:
            if w not in self.vocabulary_:
                max_idx += 1
                self.vocabulary_[w] = max_idx

        #update idf_
        df = (self.n_docs + self.smooth_idf)/np.exp(self.idf_ - 1) - self.smooth_idf
        self.n_docs += 1
        df.resize(len(self.vocabulary_))
        for w in tokens:
            df[self.vocabulary_[w]] += 1
        idf = np.log((self.n_docs + self.smooth_idf)/(df + self.smooth_idf)) + 1
        self._tfidf._idf_diag = dia_matrix((idf, 0), shape=(len(idf), len(idf)))

TfidfVectorizer.partial_fit = partial_fit
articleList = ['here is some text blah blah','another text object', 'more foo for your bar right now']
vec = TfidfVectorizer()
vec.fit(articleList)
vec.n_docs = len(articleList)
vec.partial_fit(['the last text I wanted to add'])
vec.transform(['the last text I wanted to add']).toarray()

# array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
#          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
#          0.        ,  0.        ,  0.27448674,  0.        ,  0.43003652,
#          0.43003652,  0.43003652,  0.43003652,  0.43003652]])

这篇关于将新文本添加到 Sklearn TFIDIF Vectorizer (Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆