Spark ML:数据非规范化 [英] Spark ML: Data de-normalization

查看:75
本文介绍了Spark ML:数据非规范化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要对使用Spark中ML的MinMaxScaler方法标准化的数据进行反标准化.

I need to de-normalize data that was normalized using the MinMaxScaler method of ML in Spark.

我能够按照以下步骤归一化我的数据:火花:将RDD [LabeledPoint]转换为数据帧以应用MinMaxScaler,然后缩放后,获取我之前发布的标准化RDD [LabeledPoint] .

I was able to normalize my data following these steps: Spark: convert an RDD[LabeledPoint] to a Dataframe to apply MinMaxScaler, and after scaling get the normalized RDD[LabeledPoint] that I've posted earlier.

例如,原始的df具有前两列,缩放后的结果是:

For example, the original df had the two first columns and, after scaling, the result was:

+------+--------------------+--------------------+
|labels|            features|      featuresScaled|
+------+--------------------+--------------------+
|   1.0|[6.0,7.0,42.0,1.1...|[1.0,0.2142857142...|
|   1.0|[6.0,18.0,108.0,3...|[1.0,1.0,1.0,1.0,...|
|   1.0|[5.0,7.0,35.0,1.4...|[0.0,0.2142857142...|
|   1.0|[5.0,8.0,40.0,1.6...|[0.0,0.2857142857...|
|   1.0|[6.0,4.0,24.0,0.6...|[1.0,0.0,0.0,0.0,...|
+------+--------------------+--------------------+

问题是,现在我需要做相反的过程:去规范化.

The problem is, now I need to do the opposite process: de-normalize.

为此,我需要features向量内每个要素列的minmax值,以及要进行非规范化的值.

To do so, I need the min and max values for each feature column inside the features vector, and the values to be denormalized.

要获取minmax,请向MinMaxScaler进行如下操作:

To get min and max, I ask to the MinMaxScaler as follows:

val df_fitted = scaler.fit(df_all)
val df_fitted_original_min = df_fited.originalMin   // Vector
val df_fitted_original_max = df_fited.originalMax   // Vector

df_fited_original_min[1.0,1.0,7.0,0.007,0.052,0.062,1.0,1.0,7.0,1.0]
df_fited_original_max[804.0,553.0,143993.0,537.0,1.0,1.0,4955.0,28093.0,42821.0,3212.0]

另一方面,我有这样的DataFrame:

And, on the other hand, I have the DataFrame as this:

+--------------------+-----+--------------------+--------------------+-----+-----+--------------------+--------------------+--------------------+-----+
|               col_0|col_1|               col_2|               col_3|col_4|col_5|               col_6|               col_7|               col_8|col_9|
+--------------------+-----+--------------------+--------------------+-----+-----+--------------------+--------------------+--------------------+-----+
|0.009069428120139292|  0.0|9.015488712438252E-6|2.150418860440459E-4|  1.0|  1.0|0.001470074844665...|2.205824685144127...|2.780971210319238...|  0.0|
|0.008070826019024355|  0.0|3.379696051366339...|2.389342641479033...|  1.0|  1.0|0.001308210192425627|1.962949264985630...|1.042521123176856...|  0.0|
|0.009774715414895803|  0.0|1.299590589291292...|1.981673063697640...|  1.0|  1.0|0.001584395736407...|2.377361424206848...| 4.00879434193585E-5|  0.0|
|0.009631155146285946|  0.0|1.218569739510422...|2.016021040879828E-4|  1.0|  1.0|0.001561125874539...|2.342445354515269...|3.758872615157643E-5|  0.0|

现在,我需要应用以下方程式来获取新值,但我不知道如何做到这一点.

Now, I need to apply the following equation to get the new values, but I do not how can I make it.

X_original = ( X_scaled * (max - min) ) + min

对于DF中的每个位置,我必须将此方程式以及相应的maxmin值应用到向量中.

For each position in the DF, I have to apply this equation with the corresponding max and min value into the vector.

例如:在DF的第一行和第一列中是0.009069428120139292.在同一列中,相应的minmax值是:1.0804.0. 因此,非规范化值为:

For example: In the first row and column of the DF is 0.009069428120139292. In the same column, the corresponding min and max values are: 1.0 and 804.0. So, the denormalized value is:

X_den = ( 0.009069428120139292 * (804.0 - 1.0) ) + 1.0

有必要弄清楚在程序执行过程中首先归一化的DF.因此,我需要应用非规范化(否则,最简单的方法是保留原始DF的副本).

It is necessary to clarify that the DF that was normalized in first place, during the program was modified. Due to that I need apply the de-normalization (if not, the easiest way is to a keep a copy of the original DF).

推荐答案

我从以下一个答案中得到了答案: https://stackoverflow.com/a/50314767/9759150 ,并且略微适应了我的问题,我已经完成了非规范化过程.

I've got the answer from the following one https://stackoverflow.com/a/50314767/9759150, plus a slightly adaptation to my problem I've completed the de-normalization process.

让我们考虑将normalized_df作为具有10列的数据框(在我的问题中显示):

Let's consider normalized_df as the dataframe with 10 columns (showd in my question):

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._

val updateFunction = (columnValue: Column, minValue: Int, maxValue: Int) =>
    (columnValue * ( lit(maxValue) - lit(minValue))) + lit(minValue)

val updateColumns = (df: DataFrame, minVector: Vector, maxVector: Vector, updateFunction: (Column, Int, Int) => Column) => {
    val columns = df.columns
    minVector.toArray.zipWithIndex.map{
      case (updateValue, index) =>
        updateFunction( col(columns(index.toInt)), minVector(index).toInt, maxVector(index).toInt ).as(columns(index.toInt))
    }
}

var dfUpdated = normalized_df.select(
  updateColumns(normalized_df, df_fitted_original_min, df_fitted_original_max, updateFunction) :_*
)

这篇关于Spark ML:数据非规范化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆