将CountVectorizer和TfidfTransformer稀疏矩阵转换为单独的 pandas 数据帧行 [英] Convert CountVectorizer and TfidfTransformer Sparse Matrices into Separate Pandas Dataframe Rows

查看:91
本文介绍了将CountVectorizer和TfidfTransformer稀疏矩阵转换为单独的 pandas 数据帧行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:将sklearn的CountVectorizer和TfidfTransformer生成的稀疏矩阵转换为Pandas DataFrame列的最佳方法是什么,每个Bigram及其对应的频率和tf-idf得分应单独排成一行?

Question: What is the best way to convert sparse matrices resulting from sklearn's CountVectorizer and TfidfTransformer into Pandas DataFrame columns with a separate row for each bigram and its corresponding frequency and tf-idf score?

管道::从SQL DB引入文本数据,将文本拆分为双字母组,并计算每个文档的频率和每个文档的tf-idf,然后将结果加载回SQL DB.

Pipeline: Bring in text data from a SQL DB, split text into bigrams and calculate the frequency per document and the tf-idf per bigram per document, load the results back into the SQL DB.

当前状态:

引入两列数据(numbertext).清除text以产生第三列cleanText:

Two columns of data are brought in (number, text). text is cleaned to produce a third column cleanText:

   number                               text              cleanText
0     123            The farmer plants grain    farmer plants grain
1     234  The farmer and his son go fishing  farmer son go fishing
2     345            The fisher catches tuna    fisher catches tuna

此DataFrame被馈入sklearn的特征提取中:

This DataFrame is fed into sklearn's feature extraction:

cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)

tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)

然后将矩阵转换为数组后,将其反馈到原始DataFrame中:

Then the matrices are fed back into the original DataFrame after converting them to an array:

data['frequency'] = list(dt_mat.toarray())
data['tfidf_score']=list(tfidf_mat.toarray())

输出:

   number                               text              cleanText  \
0     123            The farmer plants grain    farmer plants grain   
1     234  The farmer and his son go fishing  farmer son go fishing   
2     345            The fisher catches tuna    fisher catches tuna   

               frequency                                        tfidf_score  

0  [0, 1, 0, 0, 0, 1, 0]  [0.0, 0.707106781187, 0.0, 0.0, 0.0, 0.7071067...  
1  [0, 0, 1, 0, 1, 0, 1]  [0.0, 0.0, 0.57735026919, 0.0, 0.57735026919, ...  
2  [1, 0, 0, 1, 0, 0, 0]  [0.707106781187, 0.0, 0.0, 0.707106781187, 0.0... 

问题:

  1. 要素名称(即双字)不在DataFrame中
  2. 对于每个二元组,frequencytfidf_score不在单独的行上
  1. The feature names (i.e. bigrams) are not in the DataFrame
  2. The frequency and tfidf_score are not on separate lines for each bigram

所需的输出:

       number                    bigram         frequency      tfidf_score
0     123            farmer plants                 1              0.70  
0     123            plants grain                  1              0.56
1     234            farmer son                    1              0.72
1     234            son go                        1              0.63
1     234            go fishing                    1              0.34
2     345            fisher catches                1              0.43
2     345            catches tuna                  1              0.43

我设法使用以下代码获得分配给DataFrame单独行的数字列之一:

I managed to get one of the numeric columns assigned to separate rows of the DataFrame with this code:

data.reset_index(inplace=True)
rows = []
_ = data.apply(lambda row: [rows.append([row['number'], nn]) 
                         for nn in row.tfidf_score], axis=1)
df_new = pd.DataFrame(rows, columns=['number', 'tfidf_score'])

输出:

    number  tfidf_score
0      123     0.000000
1      123     0.707107
2      123     0.000000
3      123     0.000000
4      123     0.000000
5      123     0.707107
6      123     0.000000
7      234     0.000000
8      234     0.000000
9      234     0.577350
10     234     0.000000
11     234     0.577350
12     234     0.000000
13     234     0.577350
14     345     0.707107
15     345     0.000000
16     345     0.000000
17     345     0.707107
18     345     0.000000
19     345     0.000000
20     345     0.000000

但是,我不确定如何对两个数字列都执行此操作,并且这本身也不会引入二元符号(功能名称).另外,此方法需要一个数组(这就是为什么我首先将稀疏矩阵转换为数组的原因),并且由于性能问题以及后来我不得不剥离无意义的行这一事实,我想避免这种情况

However, I am unsure how to do this for both numeric columns, and this doesn't bring in the bigrams (feature names) themselves. Also, this method requires an array (which is why I converted the sparse matrices to arrays in the first place), and I would like to avoid this if possible due to performance issues and the fact that I would then have to strip the meaningless rows.

任何见识都将不胜感激!非常感谢您抽出宝贵的时间阅读此问题-对此致歉,我深表歉意.请让我知道我有什么办法可以改善问题或澄清我的过程.

Any insight is greatly appreciated! Thank you very much for taking the time to read this question - I apologize for the length. Please let me know if there's anything I can do to improve the question or clarify my process.

推荐答案

可以使用CountVectorizer

在这种情况下,CountVectorizer功能名称是双字母组:

The CountVectorizer feature names are, in this case, the bigrams:

print(cv.get_feature_names())

[u'catches tuna',
 u'farmer plants',
 u'farmer son',
 u'fisher catches',
 u'go fishing',
 u'plants grain',
 u'son go']

CountVectorizer.fit_transform()返回一个稀疏矩阵.我们可以将其转换为密集表示形式,将其包装在DataFrame中,然后将特征名称添加为列:

CountVectorizer.fit_transform() returns a sparse matrix. We can convert it to a dense representation, wrap it in a DataFrame, and then tack on the feature names as columns:

bigrams = pd.DataFrame(dt_mat.todense(), index=data.index, columns=cv.get_feature_names())
bigrams['number'] = data.number
print(bigrams)

   catches tuna  farmer plants  farmer son  fisher catches  go fishing  \
0             0              1           0               0           0   
1             0              0           1               0           1   
2             1              0           0               1           0   

   plants grain  son go  number  
0             1       0     123  
1             0       1     234  
2             0       0     345  

要从宽格式到长格式,请使用 .
然后将结果限制为bigram匹配项( query() 在这里很有用):

To go from wide to long format, use melt().
Then restrict the results to bigram matches (query() is useful here):

bigrams_long = (pd.melt(bigrams.reset_index(), 
                       id_vars=['index','number'],
                       value_name='bigram_ct')
                 .query('bigram_ct > 0')
                 .sort_values(['index','number']))

    index  number        variable  bigram_ct
3       0     123   farmer plants          1
15      0     123    plants grain          1
7       1     234      farmer son          1
13      1     234      go fishing          1
19      1     234          son go          1
2       2     345    catches tuna          1
11      2     345  fisher catches          1

现在对tfidf重复该过程:

tfidf = pd.DataFrame(tfidf_mat.todense(), index=data.index, columns=cv.get_feature_names())
tfidf['number'] = data.number

tfidf_long = pd.melt(tfidf.reset_index(), 
                     id_vars=['index','number'], 
                     value_name='tfidf').query('tfidf > 0')

最后,合并bigramstfidf:

fulldf = (bigrams_long.merge(tfidf_long, 
                             on=['index','number','variable'])
                      .set_index('index'))

       number        variable  bigram_ct     tfidf
index                                             
0         123   farmer plants          1  0.707107
0         123    plants grain          1  0.707107
1         234      farmer son          1  0.577350
1         234      go fishing          1  0.577350
1         234          son go          1  0.577350
2         345    catches tuna          1  0.707107
2         345  fisher catches          1  0.707107

这篇关于将CountVectorizer和TfidfTransformer稀疏矩阵转换为单独的 pandas 数据帧行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆