Python:计算Pandas中两列之间的tf-idf余弦相似度时出现MemoryError [英] Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas

查看:506
本文介绍了Python:计算Pandas中两列之间的tf-idf余弦相似度时出现MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算Pandas数据帧中两列之间的tf-idf向量余弦相似度.一列包含搜索查询,另一列包含产品标题.余弦相似度值旨在成为搜索引擎/排名机器学习算法的功能".

I'm trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The cosine similarity value is intended to be a "feature" for a search engine/ranking machine learning algorithm.

我正在iPython笔记本中进行此操作,很不幸遇到了MemoryErrors,并且不确定为什么要经过几个小时的挖掘.

I'm doing this in an iPython notebook and am unfortunately running into MemoryErrors and am not sure why after a few hours of digging.

我的设置:

  • 联想E560笔记本电脑
  • Core i7-6500U @ 2.50 GHz
  • 16 GB Ram
  • Windows 10
  • 使用anaconda 3.5内核并更新所有库

我已经按照类似的stackoverflow问题在一个小的玩具数据集上测试了我的代码/目标:

I've tested my code/goal on a small toy dataset as per a similar stackoverflow question thusly:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial

clf = TfidfVectorizer()

a = ['hello world', 'my name is', 'what is your name?', 'max cosine sim']
b = ['my name is', 'hello world', 'my name is what?', 'max cosine sim']

df = pd.DataFrame(data={'a':a, 'b':b})

clf.fit(df['a'] + " " + df['b'])

tfidf_a = clf.transform(df['a']).todense()
tfidf_b = clf.transform(df['b']).todense()

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]

df['tfidf_cosine_similarity'] = row_similarities

print(df)

这给出了以下(好!)输出:

This gives the following (good!) output:

                   a                 b  tfidf_cosine_similarity
0         hello world        my name is                 0.000000
1          my name is       hello world                 0.000000
2  what is your name?  my name is what?                 0.725628
3      max cosine sim    max cosine sim                 1.000000

但是,当我尝试将相同的方法应用于尺寸为186,154 x 5(其中5列中的2列是查询(search_term)和文档(product_title))的数据框(df_all_export)时:

However, when I try to apply the same method to a dataframe (df_all_export) with dimensions 186,154 x 5 (where 2 of the 5 columns the query (search_term) and document (product_title) as such:

clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])

tfidf_a = clf.transform(df_all_export['search_term']).todense()
tfidf_b = clf.transform(df_all_export['product_title']).todense()

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]
df_all_export['tfidf_cosine_similarity'] = row_similarities

df_all_export.head()

我明白了……(这里没有给出全部错误,但您知道了)

I get...(haven't given the whole error here but you get the idea):

MemoryError                               Traceback (most recent call last)
<ipython-input-27-8308fcfa8f9f> in <module>()
     12 clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])
     13 
---> 14 tfidf_a = clf.transform(df_all_export['search_term']).todense()
     15 tfidf_b = clf.transform(df_all_export['product_title']).todense()
     16

在这方面绝对输了,但我担心解决方案会非常简单而优雅:)

Absolutely lost on this one, but I fear the solution will be quite simple and elegant :)

提前谢谢!

推荐答案

您仍然可以使用

paired_cosine_distances将显示您的字符串有多远或有多少不同(比较两列逐行"中的值)

paired_cosine_distances will show you how far or how different are your strings (compare values in two columns "row-by-row")

0-表示完全匹配

In [136]: paired_cosine_distances(A, B)
Out[136]: array([ 1.        ,  1.        ,  0.27437247,  0.        ])

cosine_similarity会将列a的第一个字符串与列b(第1行)中的所有字符串进行比较; a列的第二个字符串,所有列在b列(第2行)中,依此类推...

cosine_similarity will compare first string of column a with all strings in column b (row 1); second string of column a with all strings in column b (row 2) and so on...

In [137]: cosine_similarity(A, B)
Out[137]:
array([[ 0.        ,  1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.74162106,  0.        ],
       [ 0.43929881,  0.        ,  0.72562753,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])

In [141]: A
Out[141]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
        with 12 stored elements in Compressed Sparse Row format>

In [142]: B
Out[142]:
<4x10 sparse matrix of type '<class 'numpy.float64'>'
        with 12 stored elements in Compressed Sparse Row format>

注意:所有计算都使用 sparsed 矩阵进行-我们没有在内存中解压缩它们!

NOTE: all calculations have been donw using sparsed matrixes - we didn't uncompress them in memory!

这篇关于Python:计算Pandas中两列之间的tf-idf余弦相似度时出现MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆