Spark R公式解读 [英] Spark RFormula Interpretation

查看:30
本文介绍了Spark R公式解读的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Spark The Definitive Guide",我在 MLlib 章节中遇到了一个代码部分,其中包含以下代码:

I was reading "Spark The Definitive Guide", i came across a code section in MLlib chapter which has the following code:

var df = spark.read.json("/data/simple-ml") 
df.orderBy("value2").show()
import org.apache.spark.ml.feature.RFormula
// Unable to understand the interpretation of this formulae
val supervised = new RFormula().setFormula("lab ~ . + color:value1 + color:value2")
val fittedRF = supervised.fit(df)
val preparedDF = fittedRF.transform(df) 
preparedDF.show()

其中/data/simple-ml 包含一个 JSON 文件,其中包含(例如):-

Where /data/simple-ml contains a JSON file containing(e.g):-

"lab":"good","color":"green","value1":1,"value2":14.386294994851129实验室":坏",颜色":蓝色",值1":8,值2":14.386294994851129实验室":坏",颜色":蓝色",值1":12,值2":14.386294994851129"lab":"good","color":"green","value1":15,"value2":38.9718713375581

"lab":"good","color":"green","value1":1,"value2":14.386294994851129 "lab":"bad","color":"blue","value1":8,"value2":14.386294994851129 "lab":"bad","color":"blue","value1":12,"value2":14.386294994851129 "lab":"good","color":"green","value1":15,"value2":38.9718713375581

您可以在 https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml/part-r-00000-f5c243b9-a015-4a3b-a4a8-eca00f80f04c.json以上几行产生的输出为:-

you can find the complete data set at https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml/part-r-00000-f5c243b9-a015-4a3b-a4a8-eca00f80f04c.json and above lines produces the output as:-

[green,good,1,14.386294994851129,(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129],0br/>[blue,bad,8,14.386294994851129,(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129]),1.[blue,bad,12,14.386294994851129,(10,[2,3,6,9],[12.0,14.386294994851129,12.0,14.386294994851129].]
[green,good,15,38.97187133755819,(10,[0,2,3,4,7],[1.0,15.0,38.97187133755819,15.0,38.9718713375]0.0)

[green,good,1,14.386294994851129,(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129]),0.0]
[blue,bad,8,14.386294994851129,(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129]),1.0]
[blue,bad,12,14.386294994851129,(10,[2,3,6,9],[12.0,14.386294994851129,12.0,14.386294994851129]),1.0]
[green,good,15,38.97187133755819,(10,[0,2,3,4,7],[1.0,15.0,38.97187133755819,15.0,38.97187133755819]),0.0]

现在我无法理解它是如何计算第 5 个(以粗体标记的)列值的.

Now i am not able to understand how it is calculating the 5th(marked in bold) column value.

推荐答案

第 5 列是表示 Spark 中稀疏向量的结构.它包含三个组成部分:

The 5-th column is a structure representing sparse vectors in Spark. It has three components:

  • 向量长度 - 在这种情况下,所有向量的长度为 10 个元素
  • 索引数组保存非零元素的索引
  • 非零值的值数组

所以

(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])

表示以下长度为 10 的稀疏向量(取第 i 个值并将其放在 i 位置):

represent the following sparse vector of length 10 (take the i-th value and place it in position i):

 0       2    3                   4          7
[1.0, 0, 1.0, 14.386294994851129, 1.0, 0, 0, 14.386294994851129, 0, 0]

(显示非零元素的位置)

(the positions of the non-zero elements are shown)

该向量的各个组成部分是什么?根据 文档:

What are the individual components of that vector? According to the documentation:

RFormula 生成特征的向量列和标签的双列或字符串列.就像在 R 中使用公式进行线性回归一样,字符串输入列将被单热编码,数字列将被转换为双精度.如果标签列是字符串类型,它会首先用 StringIndexer 转换为 double.如果 DataFrame 中不存在标签列,则将根据公式中指定的响应变量创建输出标签列.

RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If the label column is of type string, it will be first transformed to double with StringIndexer. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.

lab ~ .+ color:value1 + color:value2 是来自 R 语言的特殊语法.它描述了一个模型,该模型将 lab 的价值回归到所有其他功能以及两个交互(产品)项上.您可以通过打印 fittedRF 并查看它包含的 ResolvedRFormula 实例来查看所有功能的列表:

lab ~ . + color:value1 + color:value2 is a special syntax that comes from the R language. It describes a model that regresses the value of lab on all the other features plus two interaction (product) terms. You can see the list of all features by printing fittedRF and looking at the ResolvedRFormula instance it contains:

scala> println(fittedRF)
RFormulaModel(
 ResolvedRFormula(
  label=lab,
  terms=[color,value1,value2,{color,value1},{color,value2}],
  hasIntercept=true
 )
) (uid=rFormula_0847e597e817)

我已将输出分成几行并缩进以提高可读性.所以<代码>.+ color:value1 + color:value2 扩展为 [color,value1,value2,{color,value1},{color,value2}].其中,color 是一个分类特征,它使用以下映射在一组指标特征中进行单热编码:

I've split the output in lines and indented it for readability. So . + color:value1 + color:value2 expands to [color,value1,value2,{color,value1},{color,value2}]. Of those, color is a categorical feature and it gets one-hot encoded in a set of indicator features using the following mapping:

  • green 变成 [1, 0]
  • blue 变成 [0, 0]
  • red 变成 [0, 1]
  • green becomes [1, 0]
  • blue becomes [0, 0]
  • red becomes [0, 1]

虽然你有三个类别,但只有两个用于编码.Blue 在这种情况下被丢弃,因为它的存在没有信息价值——如果它存在,所有三列的总和总是为 1,这使得它们线性相关.删除蓝色类别的效果是它成为作为截距的一部分的基线,并且拟合模型预测将类别从蓝色更改为绿色 或从 bluered 将在标签上.这种特定的编码选择有点随意——在我的系统上,redgreen 的列交换了.

Although you have three categories, only two are used for the encoding. Blue in this case gets dropped since its presence has no information value - if it was there, all three columns will always sum to 1, which makes them linearly dependent. The effect of dropping the blue category is that it becomes the baseline as part of the intercept and the fitted model predicts what effect changing the category from blue to green or from blue to red will have on the label. That particular choice of encoding is a bit arbitrary - on my system the columns for red and green came out swapped.

value1value2 是双精度值,因此它们在特征向量中保持不变.{color,value1}color 特征和 value1 特征的乘积,因此是 one-hot 编码的乘积color 与标量 value1,产生三个新特性.请注意,在这种情况下,我们不能删除一个类别,因为交互使基础"值依赖于交互中第二个特征的值.{color,value2} 也一样.所以你最终会得到 2 + 1 + 1 + 3 + 3 或 10 个特征.您在 show() 的输出中看到的是可用作其他 Spark ML 类输入的组合向量特征列.

value1 and value2 are doubles, so they go unchanged in the feature vector. {color,value1} is the product of the color feature and the value1 feature, so that is the product of the one-hot encoding of color with the scalar value1, resulting in three new features. Notice that in this case we cannot drop one category because the interaction makes the "base" value dependent on the value of the second feature in the interaction. Same for {color,value2}. So you end up with 2 + 1 + 1 + 3 + 3 or 10 features in total. What you see in the output of show() is the assembled vector feature column that can be used as input by other Spark ML classes.

这是如何读取第一行:

(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])

[1.0, 0, 1.0, 14.386294994851129, 1.0, 0, 0, 14.386294994851129, 0, 0]
 |--1--| |2|  |-------3--------|  |---4---|  |----------5-----------|

其中包含以下单个组件:

which contains the following individual components:

  1. [1.0, 0, ...] - colorgreen类的one-hot编码(减去线性相关的第三类)>
  2. [..., 1.0, ...] - value1, value 1
  3. [..., 14.386294994851129, ...] - value2, value 14,38629...
  4. [..., 1.0, 0, 0, ...] - color x value1 交互项,green的one-hot编码的乘积 ([1, 0, 0]) 和 1
  5. [..., 14.386294994851129, 0, 0] - color x value2 交互项,green的one-hot编码的产物([1, 0, 0]) 和 14,38629...
  1. [1.0, 0, ...] - color, one-hot encoding (minus the linearly dependent third category) of category green
  2. [..., 1.0, ...] - value1, value 1
  3. [..., 14.386294994851129, ...] - value2, value 14,38629...
  4. [..., 1.0, 0, 0, ...] - color x value1 interaction term, product of one-hot encoding of green ([1, 0, 0]) and 1
  5. [..., 14.386294994851129, 0, 0] - color x value2 interaction term, product of one-hot encoding of green ([1, 0, 0]) and 14,38629...

这篇关于Spark R公式解读的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆