在columnSimilarties()Spark Scala之后获取列名称 [英] Get Column Names after columnSimilarties() Spark scala

查看:190
本文介绍了在columnSimilarties()Spark Scala之后获取列名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用spark中的columnSimilarities()构建基于项目的协作过滤模型.使用columnsSimilarities()之后,我想将原始列名称分配回Spark Scala中的结果.

I'm trying to build item based collaborative filtering model with columnSimilarities() in spark. After using the columnsSimilarities() I want to assign the original column names back to the results in Spark scala.

可运行代码来计算数据帧上的columnSimilarities().

Runnable code to calculate columnSimilarities() on data frame.

数据

// rdd
val rowsRdd: RDD[Row] = sc.parallelize(
  Seq(
    Row(2.0, 7.0, 1.0),
    Row(3.5, 2.5, 0.0),
    Row(7.0, 5.9, 0.0)
  )
)

// Schema  
val schema = new StructType()
  .add(StructField("item_1", DoubleType, true))
  .add(StructField("item_2", DoubleType, true))
  .add(StructField("item_3", DoubleType, true))

// Data frame  
val df = spark.createDataFrame(rowsRdd, schema) 

计算该数据帧上的columnSimilarities():

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}

val rows = new VectorAssembler().setInputCols(df.columns).setOutputCol("vs")
  .transform(df)
  .select("vs")
  .rdd

val items_mllib_vector = rows.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
                             .map(org.apache.spark.mllib.linalg.Vectors.fromML)
val mat = new RowMatrix(items_mllib_vector)
val simsPerfect = mat.columnSimilarities()


simsPerfect.entries.collect.mkString(", ")

输出:

res0: String = MatrixEntry(0,2,0.24759378423606918), MatrixEntry(1,2,0.7376189553526812), MatrixEntry(0,1,0.8355316482961213)

我必须从列中获取原始名称,而不是该向量中的位置.

I've to get the original names from columns instead of the position in that vector.

我尝试使用以下方法从df中读取列名称:

I tried to read the column names from df with:

val names = df.columns

我的想法是使名称与该向量中的位置匹配,但顺序应相同,但我不知道如何使用cosineSimilarities将名称重新附加到该向量中.

and my idea was to match the names with the positions in that vector wich should be in the same order, but I don't know how to attach the names back into that vector with the cosineSimilarities.

我很乐意提供建议!

推荐答案

提取列名(这是棘手的部分,因为无法在闭包中对其进行求值):

Extract columns names (this is the tricky part here because it cannot be evaluated in the closure):

val names = df.columns

map条目:

simsPerfect.entries.map {
  case MatrixEntry(i, j, v)  => (names(i.toInt),  names(j.toInt), v)
}

这篇关于在columnSimilarties()Spark Scala之后获取列名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆