Python:计算Pandas中两列之间的tf-idf余弦相似度时出现MemoryError [英] Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas
问题描述
我正在尝试计算Pandas数据帧中两列之间的tf-idf向量余弦相似度.一列包含搜索查询,另一列包含产品标题.余弦相似度值旨在成为搜索引擎/排名机器学习算法的功能".
I'm trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The cosine similarity value is intended to be a "feature" for a search engine/ranking machine learning algorithm.
我正在iPython笔记本中进行此操作,很不幸遇到了MemoryErrors,并且不确定为什么要经过几个小时的挖掘.
I'm doing this in an iPython notebook and am unfortunately running into MemoryErrors and am not sure why after a few hours of digging.
我的设置:
- 联想E560笔记本电脑
- Core i7-6500U @ 2.50 GHz
- 16 GB Ram
- Windows 10
- 使用anaconda 3.5内核并更新所有库
我已经按照类似的stackoverflow问题在一个小的玩具数据集上测试了我的代码/目标:
I've tested my code/goal on a small toy dataset as per a similar stackoverflow question thusly:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import spatial
clf = TfidfVectorizer()
a = ['hello world', 'my name is', 'what is your name?', 'max cosine sim']
b = ['my name is', 'hello world', 'my name is what?', 'max cosine sim']
df = pd.DataFrame(data={'a':a, 'b':b})
clf.fit(df['a'] + " " + df['b'])
tfidf_a = clf.transform(df['a']).todense()
tfidf_b = clf.transform(df['b']).todense()
row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]
df['tfidf_cosine_similarity'] = row_similarities
print(df)
这给出了以下(好!)输出:
This gives the following (good!) output:
a b tfidf_cosine_similarity
0 hello world my name is 0.000000
1 my name is hello world 0.000000
2 what is your name? my name is what? 0.725628
3 max cosine sim max cosine sim 1.000000
但是,当我尝试将相同的方法应用于尺寸为186,154 x 5(其中5列中的2列是查询(search_term)和文档(product_title))的数据框(df_all_export)时:
However, when I try to apply the same method to a dataframe (df_all_export) with dimensions 186,154 x 5 (where 2 of the 5 columns the query (search_term) and document (product_title) as such:
clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])
tfidf_a = clf.transform(df_all_export['search_term']).todense()
tfidf_b = clf.transform(df_all_export['product_title']).todense()
row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ]
df_all_export['tfidf_cosine_similarity'] = row_similarities
df_all_export.head()
我明白了……(这里没有给出全部错误,但您知道了)
I get...(haven't given the whole error here but you get the idea):
MemoryError Traceback (most recent call last)
<ipython-input-27-8308fcfa8f9f> in <module>()
12 clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])
13
---> 14 tfidf_a = clf.transform(df_all_export['search_term']).todense()
15 tfidf_b = clf.transform(df_all_export['product_title']).todense()
16
在这方面绝对输了,但我担心解决方案会非常简单而优雅:)
Absolutely lost on this one, but I fear the solution will be quite simple and elegant :)
提前谢谢!