Spark、Scala、DataFrame:创建特征向量 [英] Spark, Scala, DataFrame: create feature vectors

查看:56
本文介绍了Spark、Scala、DataFrame:创建特征向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 DataFrame,如下所示:

I have a DataFrame that looks like follow:

userID, category, frequency
1,cat1,1
1,cat2,3
1,cat9,5
2,cat4,6
2,cat9,2
2,cat10,1
3,cat1,5
3,cat7,16
3,cat8,2

不同类别的数量是 10,我想为每个 userID 创建一个特征向量,并用零填充缺失的类别.

The number of distinct categories is 10, and I would like to create a feature vector for each userID and fill the missing categories with zeros.

所以输出将类似于:

userID,feature
1,[1,3,0,0,0,0,0,0,5,0]
2,[0,0,0,6,0,0,0,0,2,1]
3,[5,0,0,0,0,0,16,2,0,0]

这只是一个说明性的例子,实际上我有大约 200,000 个唯一的用户 ID 和 300 个唯一的类别.

It is just an illustrative example, in reality I have about 200,000 unique userID and and 300 unique category.

创建特征DataFrame的最有效方法是什么?

What is the most efficient way to create the features DataFrame?

推荐答案

假设:

val cs: SparkContext
val sc: SQLContext
val cats: DataFrame

其中 userIdfrequencybigint 列,对应于 scala.Long

Where userId and frequency are bigint columns which corresponds to scala.Long

我们正在创建中间映射RDD:

We are creating intermediate mapping RDD:

val catMaps = cats.rdd
  .groupBy(_.getAs[Long]("userId"))
  .map { case (id, rows) => id -> rows
    .map { row => row.getAs[String]("category") -> row.getAs[Long]("frequency") }
    .toMap
  }

然后按字典顺序收集所有呈现的类别

Then collecting all presented categories in the lexicographic order

val catNames = cs.broadcast(catMaps.map(_._2.keySet).reduce(_ union _).toArray.sorted)

手动创建

val catNames = cs.broadcast(1 to 10 map {n => s"cat$n"} toArray)

最后,我们将映射转换为不存在值的 0 值数组

Finally we're transforming maps to arrays with 0-values for non-existing values

import sc.implicits._
val catArrays = catMaps
      .map { case (id, catMap) => id -> catNames.value.map(catMap.getOrElse(_, 0L)) }
      .toDF("userId", "feature")

现在 catArrays.show() 打印类似

+------+--------------------+
|userId|             feature|
+------+--------------------+
|     2|[0, 1, 0, 6, 0, 0...|
|     1|[1, 0, 3, 0, 0, 0...|
|     3|[5, 0, 0, 0, 16, ...|
+------+--------------------+

这可能不是数据帧最优雅的解决方案,因为我对火花这个领域几乎不熟悉.

This could be not the most elegant solution for dataframes, as I barely familiar with this area of spark.

请注意,您可以手动创建您的 catNames,以便为缺失的 cat3cat5、...

Note, that you could create your catNames manually to add zeros for missing cat3, cat5, ...

还要注意,否则 catMaps RDD 被操作两次,你可能想要 .persist()

Also note that otherwise catMaps RDD is operated twice, you might want to .persist() it

这篇关于Spark、Scala、DataFrame:创建特征向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆