存储Tf-idf矩阵并更新 pandas 中新文章上的现有矩阵 [英] Store Tf-idf matrix and update existing matrix on new articles in pandas

查看:121
本文介绍了存储Tf-idf矩阵并更新 pandas 中新文章上的现有矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,列textnews articles组成.给出为:-

I have a pandas dataframe with column text consists of news articles. Given as:-

text
article1
article2
article3
article4

我已将商品的Tf-IDF值计算为:-

I have calculated the Tf-IDF values for articles as:-

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
matrix_1 = tfidf.fit_transform(df['text'])

由于我的数据框不时更新.因此,假设在将of-if计算为matrix_1之后,我的数据框得到了更多文章的更新.像这样:

As my dataframe is kept updating from time to time. So, let's say after calculating of-if as matrix_1 my dataframe got updated with more articles. Something like:

text
article1
article2
article3
article4
article5
article6
article7

由于我有数以百万计的文章,因此我想存储所有上一篇文章的tf-IDF矩阵,并使用新文章的tf-IDF分数对其进行更新.一次又一次地为所有文章运行of-IDF代码会占用大量内存.有什么办法可以执行此操作?

As I have millions of articles and all I want to store a tf-IDF matrix of the previous article and updating the same with tf-IDF scores of the new article. Running the of-IDF code for all articles, again and again, would be memory consuming. Is there any way I can perform this?

推荐答案

我尚未测试此代码,但我认为这应该可行.

I haven't tested this code but I feel that this should work.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.DataFrame()
while True:
    if not len(df):
        # When you dataframe is populated for the very first time
        tfidf = TfidfVectorizer()
        matrix_1 = tfidf.fit_transform(df['text'].iloc[last_len:])
        last_len = len(df)
    else:
        # When you dataframe is populated again and again
        # If you have to use earlier fitted model
        matrix_1 = np.vstack(matrix_1, tfidf.transform(df['text'].iloc[last_len:]))
        # If you have to update tf-idf every time which is kinda doesn't make sense
        matrix_1 = np.vstack(matrix_1, tfidf.fit_transform(df['text'].iloc[last_len:]))
        last_len = len(df)

    # TO-DO Some break condition according to your case
    #####

如果两次数据框更新之间的持续时间长于您可以在matrix_1上使用pickle来存储中间结果的时间.

If the duration between dataframe updates is longer than you can use pickle on matrix_1 to store intermediate results.

但是,我在不同的输入上一次又一次地使用tfidf.fit_transform(df['text'])的感觉不会给您任何有意义的结果,或者可能是我误解了.干杯!

However what I feel is using tfidf.fit_transform(df['text']) again and again on different inputs will not give you any meaningful results or may be I misunderstood. Cheers!!

这篇关于存储Tf-idf矩阵并更新 pandas 中新文章上的现有矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆