如何使用 Pandas 数据框获取 tfidf? [英] How to get tfidf with pandas dataframe?

查看:50
本文介绍了如何使用 Pandas 数据框获取 tfidf?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从下面的文档中计算 tf-idf.我正在使用 python 和 Pandas.

I want to calculate tf-idf from the documents below. I'm using python and pandas.

import pandas as pd
df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})

首先,我认为我需要为每一行获取 word_count.于是我写了一个简单的函数:

First, I thought I would need to get word_count for each row. So I wrote a simple function:

def word_count(sent):
    word2cnt = dict()
    for word in sent.split():
        if word in word2cnt: word2cnt[word] += 1
        else: word2cnt[word] = 1
return word2cnt

然后,我将其应用于每一行.

And then, I applied it to each row.

df['word_count'] = df['sent'].apply(word_count)

但现在我迷路了.我知道如果我使用 Graphlab,有一种简单的方法可以计算 tf-idf,但我想坚持使用开源选项.Sklearn 和 gensim 看起来都让人难以抗拒.获取 tf-idf 的最简单解决方案是什么?

But now I'm lost. I know there's an easy method to calculate tf-idf if I use Graphlab, but I want to stick with an open source option. Both Sklearn and gensim look overwhelming. What's the simplest solution to get tf-idf?

推荐答案

Scikit-learn 的实现真的很简单:

Scikit-learn implementation is really easy :

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])

您可以指定很多参数.请参阅文档这里

There are plenty of parameters you can specify. See the documentation here

fit_transform 的输出将是一个稀疏矩阵,如果你想把它可视化你可以做 x.toarray()

The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray()

In [44]: x.toarray()
Out[44]: 
array([[ 0.64612892,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.64612892,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.64612892,  0.38161415]])

这篇关于如何使用 Pandas 数据框获取 tfidf?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆