DataFrame上的Apache Spark Python余弦相似度 [英] Apache Spark Python Cosine Similarity over DataFrames

查看:238
本文介绍了DataFrame上的Apache Spark Python余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于推荐系统,我需要计算整个Spark DataFrame的所有列之间的余弦相似度.

For a Recommender System, I need to compute the cosine similarity between all the columns of a whole Spark DataFrame.

在熊猫中,我曾经这样做:

In Pandas I used to do this:

import sklearn.metrics as metrics
import pandas as pd

df= pd.DataFrame(...some dataframe over here :D ...)
metrics.pairwise.cosine_similarity(df.T,df.T)

这会在列之间生成相似矩阵(因为我使用了转置)

That generates the Similarity Matrix between the columns (since I used the transposition)

在Spark(Python)中,有什么方法可以做同样的事情吗?

Is there any way to do the same thing in Spark (Python)?

(我需要将此方法应用于由数千万行和数千列组成的矩阵,所以这就是为什么我需要在Spark中进行处理的原因)

(I need to apply this to a matrix made of tens of millions of rows, and thousands of columns, so that's why I need to do it in Spark)

推荐答案

您可以在RowMatrix上使用内置的columnSimilarities()方法,该方法既可以计算精确的余弦相似度,也可以使用 DIMSUM 方法,对于较大的数据集,此方法将更快.用法的区别在于,对于后者,您必须指定threshold.

You can use the built-in columnSimilarities() method on a RowMatrix, that can both calculate the exact cosine similarities, or estimate it using the DIMSUM method, which will be considerably faster for larger datasets. The difference in usage is that for the latter, you'll have to specify a threshold.

这是一个可复制的小例子:

Here's a small reproducible example:

from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])

# Convert to RowMatrix
mat = RowMatrix(rows)

# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)

# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
 MatrixEntry(1, 2, 0.998441152599),
 MatrixEntry(0, 1, 0.997463284056)]

这篇关于DataFrame上的Apache Spark Python余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆