Spark:将RDD [LabeledPoint]转换为数据框以应用MinMaxScaler,并在缩放后获得标准化的RDD [LabeledPoint] [英] Spark: convert an RDD[LabeledPoint] to a Dataframe to apply MinMaxScaler, and after scaling get the normalized RDD[LabeledPoint]
问题描述
我在代码中使用RDD [LabeledPoint].但是现在我必须使用MinMax方法规范化数据.
I'm using RDD[LabeledPoint] in my code. But now I have to normalize data using the MinMax method.
我看到ml库中存在MinMaxScaler,但这可用于DataFrames: org.apache.spark.ml.feature.MinMaxScaler
.
I saw that exist in ml library the MinMaxScaler, but this works with DataFrames: org.apache.spark.ml.feature.MinMaxScaler
.
由于完整的代码已经用RDD编写了,我想我可以按照以下步骤进行操作,以免更改其他内容:
Because of the full code was already written with RDDs, I think I could do the followings steps to don't change anything else:
- 将RDD [LabeledPoint]转换为DataFrame
- 将MinMaxScaler应用于数据框
- 将DataFrame转换为RDD [LabeledPoint]
问题是我不怎么做.我没有列名(但是LabeledPoint中的特征向量具有9维),并且我也无法适应其他示例.例如,以下代码: https://stackoverflow.com/a/36909553/5081366 或缩放数据框的每一列
The thing is I do not how can I make it. I don't have column names (but the feature vector in the LabeledPoint has 9 dimension), and I also couldn't adapt other examples to my case. For instance, the code in: https://stackoverflow.com/a/36909553/5081366 or Scaling each column of a dataframe
感谢您的帮助!
推荐答案
最后,我能够回答我自己的问题!
Finally, I am able to answer my own question!
其中 allData
是 RDD [LabeledPoint]
:
// The following import doesn't work externally because the implicits object is defined inside the SQLContext class
val sqlContext = SparkSession
.builder()
.appName("Spark In Action")
.master("local")
.getOrCreate()
import sqlContext.implicits._
// Create a DataFrame from RDD[LabeledPoint]
val all = allData.map(e => (e.label, e.features))
val df_all = all.toDF("labels", "features")
// Scaler instance above with the same min(0) and max(1)
val scaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("featuresScaled")
.setMax(1)
.setMin(0)
// Scaling
var df_scaled = scaler.fit(df_all).transform(df_all)
// Drop the unscaled column
df_scaled = df_scaled.drop("features")
// Convert DataFrame to RDD[LabeledPoint]
val rdd_scaled = df_scaled.rdd.map(row => LabeledPoint(
row.getAs[Double]("labels"),
row.getAs[Vector]("featuresScaled")
))
我希望这会对其他人有所帮助!
I hope this will help someone else!
这篇关于Spark:将RDD [LabeledPoint]转换为数据框以应用MinMaxScaler,并在缩放后获得标准化的RDD [LabeledPoint]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!