Spark:将RDD [LabeledPoint]转换为数据框以应用MinMaxScaler，并在缩放后获得标准化的RDD [LabeledPoint] [英] Spark: convert an RDD[LabeledPoint] to a Dataframe to apply MinMaxScaler, and after scaling get the normalized RDD[LabeledPoint]

查看：50 发布时间：2021/4/8 19:46:23 scala apache-spark

本文介绍了Spark:将RDD [LabeledPoint]转换为数据框以应用MinMaxScaler，并在缩放后获得标准化的RDD [LabeledPoint]的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在代码中使用RDD [LabeledPoint].但是现在我必须使用MinMax方法规范化数据.

I'm using RDD[LabeledPoint] in my code. But now I have to normalize data using the MinMax method.

我看到ml库中存在MinMaxScaler，但这可用于DataFrames: org.apache.spark.ml.feature.MinMaxScaler .

I saw that exist in ml library the MinMaxScaler, but this works with DataFrames: org.apache.spark.ml.feature.MinMaxScaler.

由于完整的代码已经用RDD编写了，我想我可以按照以下步骤进行操作，以免更改其他内容:

Because of the full code was already written with RDDs, I think I could do the followings steps to don't change anything else:

将RDD [LabeledPoint]转换为DataFrame
将MinMaxScaler应用于数据框
将DataFrame转换为RDD [LabeledPoint]

问题是我不怎么做.我没有列名(但是LabeledPoint中的特征向量具有9维)，并且我也无法适应其他示例.例如，以下代码: https://stackoverflow.com/a/36909553/5081366 或缩放数据框的每一列

The thing is I do not how can I make it. I don't have column names (but the feature vector in the LabeledPoint has 9 dimension), and I also couldn't adapt other examples to my case. For instance, the code in: https://stackoverflow.com/a/36909553/5081366 or Scaling each column of a dataframe

感谢您的帮助！

推荐答案

最后，我能够回答我自己的问题！

Finally, I am able to answer my own question!

其中 allData 是 RDD [LabeledPoint] :

    // The following import doesn't work externally because the implicits object is defined inside the SQLContext class
    val sqlContext = SparkSession
      .builder()
      .appName("Spark In Action")
      .master("local")
      .getOrCreate()

    import sqlContext.implicits._

    // Create a DataFrame from RDD[LabeledPoint]
    val all = allData.map(e => (e.label, e.features))
    val df_all = all.toDF("labels", "features")

    // Scaler instance above with the same min(0) and max(1)
    val scaler = new MinMaxScaler()
      .setInputCol("features")
      .setOutputCol("featuresScaled")
      .setMax(1)
      .setMin(0)

    // Scaling
    var df_scaled = scaler.fit(df_all).transform(df_all)

    // Drop the unscaled column
    df_scaled = df_scaled.drop("features")

    // Convert DataFrame to RDD[LabeledPoint]
    val rdd_scaled = df_scaled.rdd.map(row => LabeledPoint(
      row.getAs[Double]("labels"),
      row.getAs[Vector]("featuresScaled")
    ))

我希望这会对其他人有所帮助！

I hope this will help someone else!

这篇关于Spark:将RDD [LabeledPoint]转换为数据框以应用MinMaxScaler，并在缩放后获得标准化的RDD [LabeledPoint]的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark:将RDD [LabeledPoint]转换为数据框以应用MinMaxScaler，并在缩放后获得标准化的RDD [LabeledPoint] [英] Spark: convert an RDD[LabeledPoint] to a Dataframe to apply MinMaxScaler, and after scaling get the normalized RDD[LabeledPoint]

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark:将RDD [LabeledPoint]转换为数据框以应用MinMaxScaler，并在缩放后获得标准化的RDD [LabeledPoint] [英] Spark: convert an RDD[LabeledPoint] to a Dataframe to apply MinMaxScaler, and after scaling get the normalized RDD[LabeledPoint]

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭