在Spark DataFrames中创建随机特征数组 [英] Creating a Random Feature Array in Spark DataFrames
问题描述
在创建 ALS
模型时,我们可以提取 userFactors
DataFrame和 itemFactors
DataFrame.这些DataFrame包含一个带有Array的列.
When creating an ALS
model, we can extract a userFactors
DataFrame and an itemFactors
DataFrame. These DataFrames contain a column with an Array.
我想生成一些随机数据并将其合并到 userFactors
DataFrame.
I would like to generate some random data and union it to the userFactors
DataFrame.
这是我的代码:
val df1: DataFrame = Seq((123, 456, 4.0), (123, 789, 5.0), (234, 456, 4.5), (234, 789, 1.0)).toDF("user", "item", "rating")
val model1 = (new ALS()
.setImplicitPrefs(true)
.fit(df1))
val iF = model1.itemFactors
val uF = model1.userFactors
然后我使用带有以下功能的 VectorAssembler
创建一个随机DataFrame:
I then create a random DataFrame using a VectorAssembler
with this function:
def makeNew(df: DataFrame, rank: Int): DataFrame = {
var df_dummy = df
var i: Int = 0
var inputCols: Array[String] = Array()
for (i <- 0 to rank) {
df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
inputCols = inputCols :+ "feature".concat(i.toString)
}
val assembler = new VectorAssembler()
.setInputCols(inputCols)
.setOutputCol("userFeatures")
val output = assembler.transform(df_dummy)
output.select("user", "userFeatures")
}
然后,我使用新的用户ID创建DataFrame,并添加随机向量和偏差:
I then create the DataFrame with new user IDs and add the random vectors and bias:
val usersDf: DataFrame = Seq(567), (678)).toDF("user")
var usersFactorsNew: DataFrame = makeNew(usersDf, 20)
当我合并两个DataFrame时出现问题.
The problem arises when I union the two DataFrames.
usersFactorsNew.union(uF)
会产生错误:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <> array<float> at the second column of the second table;;
如果我打印模式,则 uF
DataFrame具有类型为 Array [Float]
的特征向量,并且将 usersFactorsNew
DataFrame作为特征类型为 Vector
的向量.
If I print the schema, the uF
DataFrame has a feature vector of type Array[Float]
and the usersFactorsNew
DataFrame as a feature vector of type Vector
.
我的问题是如何将 Vector
的类型更改为数组以执行联合.
My question is how to change the type of the Vector
to an Array in order to perform the union.
我尝试编写此 udf
并没有成功:
I tried writing this udf
with little success:
val toArr: org.apache.spark.ml.linalg.Vector => Array[Double] = _.toArray
val toArrUdf = udf(toArr)
也许 VectorAssembler
不是此任务的最佳选择.但是,目前,这是我找到的唯一选择.我希望得到一些更好的建议.
Perhaps the VectorAssembler
is not the best option for this task. However, at the moment, it is the only option I have found. I would love to get some recommendations for something better.
推荐答案
您可以简单地使用 UDF
直接. ALS
模型中的 userFactors
将返回 Array [Float]
,因此 UDF
的输出应与该匹配
Instead of creating a dummy dataframe and using VectorAssembler
to generate a random feature vector, you can simply use an UDF
directly. The userFactors
from the ALS
model will return an Array[Float]
so the output from the UDF
should match that.
val createRandomArray = udf((rank: Int) => {
Array.fill(rank)(Random.nextFloat)
})
请注意,这将给出间隔为[0.0,1.0]的数字(与问题中使用的 rand()
相同),如果需要其他数字,请修改为合适.
Note that this will give numbers in the interval [0.0, 1.0] (same as the rand()
used in the question), if other numbers are required, modify as fit.
使用等级3和 userDf
:
val usersFactorsNew = usersDf.withColumn("userFeatures", createRandomArray(lit(3)))
将给出如下数据框(当然具有随机特征值)
will give a dataframe as follows (of course with random feature values)
+----+----------------------------------------------------------+
|user|userFeatures |
+----+----------------------------------------------------------+
|567 |[0.6866711267486822,0.7257031656127676,0.983562255688249] |
|678 |[0.7013908820314967,0.41029552817665327,0.554591149586789]|
+----+----------------------------------------------------------+
现在应该可以将这个数据框与 uF
数据框结合起来.
Joining this dataframe with the uF
dataframe should now be possible.
UDF
不起作用的原因应该是因为它是 Array [Double],而您需要
Array [Float] 作为代码>联盟<代码>.
The reason the UDF
did not work should be due to it being an Array[Double] while you need an
Array[Float]for the
union. It should be possible to fix with a
map(_.toFloat)`.
val toArr: org.apache.spark.ml.linalg.Vector => Array[Float] = _.toArray.map(_.toFloat)
val toArrUdf = udf(toArr)
这篇关于在Spark DataFrames中创建随机特征数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!