Spark:将RDD [LabeledPoint]转换为数据框以应用MinMaxScaler,并在缩放后获得标准化的RDD [LabeledPoint] [英] Spark: convert an RDD[LabeledPoint] to a Dataframe to apply MinMaxScaler, and after scaling get the normalized RDD[LabeledPoint]

查看:50
本文介绍了Spark:将RDD [LabeledPoint]转换为数据框以应用MinMaxScaler,并在缩放后获得标准化的RDD [LabeledPoint]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在代码中使用RDD [LabeledPoint].但是现在我必须使用MinMax方法规范化数据.

I'm using RDD[LabeledPoint] in my code. But now I have to normalize data using the MinMax method.

我看到ml库中存在MinMaxScaler,但这可用于DataFrames: org.apache.spark.ml.feature.MinMaxScaler .

I saw that exist in ml library the MinMaxScaler, but this works with DataFrames: org.apache.spark.ml.feature.MinMaxScaler.

由于完整的代码已经用RDD编写了,我想我可以按照以下步骤进行操作,以免更改其他内容:

Because of the full code was already written with RDDs, I think I could do the followings steps to don't change anything else:

  1. 将RDD [LabeledPoint]转换为DataFrame
  2. 将MinMaxScaler应用于数据框
  3. 将DataFrame转换为RDD [LabeledPoint]

问题是我不怎么做.我没有列名(但是LabeledPoint中的特征向量具有9维),并且我也无法适应其他示例.例如,以下代码: https://stackoverflow.com/a/36909553/5081366 缩放数据框的每一列

The thing is I do not how can I make it. I don't have column names (but the feature vector in the LabeledPoint has 9 dimension), and I also couldn't adapt other examples to my case. For instance, the code in: https://stackoverflow.com/a/36909553/5081366 or Scaling each column of a dataframe

感谢您的帮助!

推荐答案

最后,我能够回答我自己的问题!

Finally, I am able to answer my own question!

其中 allData RDD [LabeledPoint] :

    // The following import doesn't work externally because the implicits object is defined inside the SQLContext class
    val sqlContext = SparkSession
      .builder()
      .appName("Spark In Action")
      .master("local")
      .getOrCreate()

    import sqlContext.implicits._

    // Create a DataFrame from RDD[LabeledPoint]
    val all = allData.map(e => (e.label, e.features))
    val df_all = all.toDF("labels", "features")

    // Scaler instance above with the same min(0) and max(1)
    val scaler = new MinMaxScaler()
      .setInputCol("features")
      .setOutputCol("featuresScaled")
      .setMax(1)
      .setMin(0)

    // Scaling
    var df_scaled = scaler.fit(df_all).transform(df_all)

    // Drop the unscaled column
    df_scaled = df_scaled.drop("features")

    // Convert DataFrame to RDD[LabeledPoint]
    val rdd_scaled = df_scaled.rdd.map(row => LabeledPoint(
      row.getAs[Double]("labels"),
      row.getAs[Vector]("featuresScaled")
    ))

我希望这会对其他人有所帮助!

I hope this will help someone else!

这篇关于Spark:将RDD [LabeledPoint]转换为数据框以应用MinMaxScaler,并在缩放后获得标准化的RDD [LabeledPoint]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆