在Spark DataFrames中创建随机特征数组 [英] Creating a Random Feature Array in Spark DataFrames

查看:149
本文介绍了在Spark DataFrames中创建随机特征数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在创建 ALS 模型时,我们可以提取 userFactors DataFrame和 itemFactors DataFrame.这些DataFrame包含一个带有Array的列.

When creating an ALS model, we can extract a userFactors DataFrame and an itemFactors DataFrame. These DataFrames contain a column with an Array.

我想生成一些随机数据并将其合并到 userFactors DataFrame.

I would like to generate some random data and union it to the userFactors DataFrame.

这是我的代码:

 val df1: DataFrame  = Seq((123, 456, 4.0), (123, 789, 5.0), (234, 456, 4.5), (234, 789, 1.0)).toDF("user", "item", "rating")
val model1 = (new ALS()
 .setImplicitPrefs(true)
 .fit(df1))

val iF = model1.itemFactors
val uF = model1.userFactors

然后我使用带有以下功能的 VectorAssembler 创建一个随机DataFrame:

I then create a random DataFrame using a VectorAssembler with this function:

def makeNew(df: DataFrame, rank: Int): DataFrame = {
    var df_dummy = df
    var i: Int = 0
    var inputCols: Array[String] = Array()
    for (i <- 0 to rank) {
       df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
       inputCols = inputCols :+ "feature".concat(i.toString)
      }
    val assembler = new VectorAssembler()
      .setInputCols(inputCols)
      .setOutputCol("userFeatures")
    val output = assembler.transform(df_dummy)
    output.select("user", "userFeatures")
  }

然后,我使用新的用户ID创建DataFrame,并添加随机向量和偏差:

I then create the DataFrame with new user IDs and add the random vectors and bias:

val usersDf: DataFrame = Seq(567), (678)).toDF("user")
var usersFactorsNew: DataFrame = makeNew(usersDf, 20)

当我合并两个DataFrame时出现问题.

The problem arises when I union the two DataFrames.

usersFactorsNew.union(uF)会产生错误:

 org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <> array<float> at the second column of the second table;;

如果我打印模式,则 uF DataFrame具有类型为 Array [Float] 的特征向量,并且将 usersFactorsNew DataFrame作为特征类型为 Vector 的向量.

If I print the schema, the uF DataFrame has a feature vector of type Array[Float] and the usersFactorsNew DataFrame as a feature vector of type Vector.

我的问题是如何将 Vector 的类型更改为数组以执行联合.

My question is how to change the type of the Vector to an Array in order to perform the union.

我尝试编写此 udf 并没有成功:

I tried writing this udf with little success:

val toArr: org.apache.spark.ml.linalg.Vector => Array[Double] = _.toArray
val toArrUdf = udf(toArr)

也许 VectorAssembler 不是此任务的最佳选择.但是,目前,这是我找到的唯一选择.我希望得到一些更好的建议.

Perhaps the VectorAssembler is not the best option for this task. However, at the moment, it is the only option I have found. I would love to get some recommendations for something better.

推荐答案

您可以简单地使用 UDF 直接. ALS 模型中的 userFactors 将返回 Array [Float] ,因此 UDF 的输出应与该匹配

Instead of creating a dummy dataframe and using VectorAssembler to generate a random feature vector, you can simply use an UDF directly. The userFactors from the ALS model will return an Array[Float] so the output from the UDF should match that.

val createRandomArray = udf((rank: Int) => {
  Array.fill(rank)(Random.nextFloat)
})

请注意,这将给出间隔为[0.0,1.0]的数字(与问题中使用的 rand()相同),如果需要其他数字,请修改为合适.

Note that this will give numbers in the interval [0.0, 1.0] (same as the rand() used in the question), if other numbers are required, modify as fit.

使用等级3和 userDf :

val usersFactorsNew = usersDf.withColumn("userFeatures", createRandomArray(lit(3)))

将给出如下数据框(当然具有随机特征值)

will give a dataframe as follows (of course with random feature values)

+----+----------------------------------------------------------+
|user|userFeatures                                              |
+----+----------------------------------------------------------+
|567 |[0.6866711267486822,0.7257031656127676,0.983562255688249] |
|678 |[0.7013908820314967,0.41029552817665327,0.554591149586789]|
+----+----------------------------------------------------------+

现在应该可以将这个数据框与 uF 数据框结合起来.

Joining this dataframe with the uF dataframe should now be possible.

UDF 不起作用的原因应该是因为它是 Array [Double],而您需要 Array [Float] 作为联盟<代码>.

The reason the UDF did not work should be due to it being an Array[Double] while you need anArray[Float]for theunion. It should be possible to fix with amap(_.toFloat)`.

val toArr: org.apache.spark.ml.linalg.Vector => Array[Float] = _.toArray.map(_.toFloat)
val toArrUdf = udf(toArr)

这篇关于在Spark DataFrames中创建随机特征数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆