将tfidf附加到pandas数据框 [英] Append tfidf to pandas dataframe

查看:65
本文介绍了将tfidf附加到pandas数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我具有以下熊猫结构:

col1 col2 col3 text
1    1    0    meaningful text
5    9    7    trees
7    8    2    text

我想使用tfidf矢量化器。但是,这将返回一个解析矩阵,我实际上可以通过 mysparsematrix).toarray()转换为密集矩阵。但是,如何将带有标签的信息添加到原始df中?因此目标看起来像:

I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn into a dense matrix via mysparsematrix).toarray(). However, how can I add this info with labels to my original df? So the target would look like:

col1 col2 col3 meaningful text trees
1    1    0    1          1    0
5    9    7    0          0    1
7    8    2    0          1    0

更新:

即使重命名原始列,解决方案也会使连接错误:

删除至少包含一个NaN的列即使我在开始使用它之前仍使用 fillna(0),结果只剩下7行。

Solution makes the concatenation wrong even when renaming original columns: Dropping columns with at least one NaN results in only 7 rows left, even though I use fillna(0) before starting to work with it.

推荐答案

您可以按照以下步骤操作:

You can proceed as follows:

将数据加载到数据框中:

import pandas as pd

df = pd.read_table("/tmp/test.csv", sep="\s+")
print(df)

输出:

   col1  col2  col3             text
0     1     1     0  meaningful text
1     5     9     7            trees
2     7     8     2             text

使用以下符号标记文本列: sklearn.feature_extraction.text.TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

v = TfidfVectorizer()
x = v.fit_transform(df['text'])

转换标记化数据放入数据框:

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)

输出:

   meaningful      text  trees
0    0.795961  0.605349    0.0
1    0.000000  0.000000    1.0
2    0.000000  1.000000    0.0

将标记化数据帧连接到原始数据帧:

res = pd.concat([df, df1], axis=1)
print(res)

输出:

   col1  col2  col3             text  meaningful      text  trees
0     1     1     0  meaningful text    0.795961  0.605349    0.0
1     5     9     7            trees    0.000000  0.000000    1.0
2     7     8     2             text    0.000000  1.000000    0.0

如果要删除列 text ,则需要在连接前执行以下操作:

If you want to drop the column text, you need to do that before the concatenation:

df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
print(res)

输出:

   col1  col2  col3  meaningful      text  trees
0     1     1     0    0.795961  0.605349    0.0
1     5     9     7    0.000000  0.000000    1.0
2     7     8     2    0.000000  1.000000    0.0






完整代码:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_table("/tmp/test.csv", sep="\s+")
v = TfidfVectorizer()
x = v.fit_transform(df['text'])

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)

这篇关于将tfidf附加到pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆