Spark R公式解读 [英] Spark RFormula Interpretation
问题描述
我正在阅读Spark The Definitive Guide",我在 MLlib 章节中遇到了一个代码部分,其中包含以下代码:
I was reading "Spark The Definitive Guide", i came across a code section in MLlib chapter which has the following code:
var df = spark.read.json("/data/simple-ml")
df.orderBy("value2").show()
import org.apache.spark.ml.feature.RFormula
// Unable to understand the interpretation of this formulae
val supervised = new RFormula().setFormula("lab ~ . + color:value1 + color:value2")
val fittedRF = supervised.fit(df)
val preparedDF = fittedRF.transform(df)
preparedDF.show()
其中/data/simple-ml 包含一个 JSON 文件,其中包含(例如):-
Where /data/simple-ml contains a JSON file containing(e.g):-
"lab":"good","color":"green","value1":1,"value2":14.386294994851129实验室":坏",颜色":蓝色",值1":8,值2":14.386294994851129实验室":坏",颜色":蓝色",值1":12,值2":14.386294994851129"lab":"good","color":"green","value1":15,"value2":38.9718713375581
"lab":"good","color":"green","value1":1,"value2":14.386294994851129 "lab":"bad","color":"blue","value1":8,"value2":14.386294994851129 "lab":"bad","color":"blue","value1":12,"value2":14.386294994851129 "lab":"good","color":"green","value1":15,"value2":38.9718713375581
您可以在 https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml/part-r-00000-f5c243b9-a015-4a3b-a4a8-eca00f80f04c.json以上几行产生的输出为:-
you can find the complete data set at https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml/part-r-00000-f5c243b9-a015-4a3b-a4a8-eca00f80f04c.json and above lines produces the output as:-
[green,good,1,14.386294994851129,(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129],0br/>[blue,bad,8,14.386294994851129,(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129]),1.[blue,bad,12,14.386294994851129,(10,[2,3,6,9],[12.0,14.386294994851129,12.0,14.386294994851129].]
[green,good,15,38.97187133755819,(10,[0,2,3,4,7],[1.0,15.0,38.97187133755819,15.0,38.9718713375]0.0)
[green,good,1,14.386294994851129,(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129]),0.0]
[blue,bad,8,14.386294994851129,(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129]),1.0]
[blue,bad,12,14.386294994851129,(10,[2,3,6,9],[12.0,14.386294994851129,12.0,14.386294994851129]),1.0]
[green,good,15,38.97187133755819,(10,[0,2,3,4,7],[1.0,15.0,38.97187133755819,15.0,38.97187133755819]),0.0]
现在我无法理解它是如何计算第 5 个(以粗体标记的)列值的.
Now i am not able to understand how it is calculating the 5th(marked in bold) column value.
推荐答案
第 5 列是表示 Spark 中稀疏向量的结构.它包含三个组成部分:
The 5-th column is a structure representing sparse vectors in Spark. It has three components:
- 向量长度 - 在这种情况下,所有向量的长度为 10 个元素
- 索引数组保存非零元素的索引
- 非零值的值数组
所以
(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])
表示以下长度为 10 的稀疏向量(取第 i 个值并将其放在 i 位置):
represent the following sparse vector of length 10 (take the i-th value and place it in position i):
0 2 3 4 7
[1.0, 0, 1.0, 14.386294994851129, 1.0, 0, 0, 14.386294994851129, 0, 0]
(显示非零元素的位置)
(the positions of the non-zero elements are shown)
该向量的各个组成部分是什么?根据 文档:
What are the individual components of that vector? According to the documentation:
RFormula 生成特征的向量列和标签的双列或字符串列.就像在 R 中使用公式进行线性回归一样,字符串输入列将被单热编码,数字列将被转换为双精度.如果标签列是字符串类型,它会首先用 StringIndexer
转换为 double.如果 DataFrame 中不存在标签列,则将根据公式中指定的响应变量创建输出标签列.
RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If the label column is of type string, it will be first transformed to double with
StringIndexer
. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.
lab ~ .+ color:value1 + color:value2
是来自 R 语言的特殊语法.它描述了一个模型,该模型将 lab
的价值回归到所有其他功能以及两个交互(产品)项上.您可以通过打印 fittedRF
并查看它包含的 ResolvedRFormula
实例来查看所有功能的列表:
lab ~ . + color:value1 + color:value2
is a special syntax that comes from the R language. It describes a model that regresses the value of lab
on all the other features plus two interaction (product) terms. You can see the list of all features by printing fittedRF
and looking at the ResolvedRFormula
instance it contains:
scala> println(fittedRF)
RFormulaModel(
ResolvedRFormula(
label=lab,
terms=[color,value1,value2,{color,value1},{color,value2}],
hasIntercept=true
)
) (uid=rFormula_0847e597e817)
我已将输出分成几行并缩进以提高可读性.所以<代码>.+ color:value1 + color:value2 扩展为 [color,value1,value2,{color,value1},{color,value2}]
.其中,color
是一个分类特征,它使用以下映射在一组指标特征中进行单热编码:
I've split the output in lines and indented it for readability. So . + color:value1 + color:value2
expands to [color,value1,value2,{color,value1},{color,value2}]
. Of those, color
is a categorical feature and it gets one-hot encoded in a set of indicator features using the following mapping:
- green 变成
[1, 0]
- blue 变成
[0, 0]
- red 变成
[0, 1]
- green becomes
[1, 0]
- blue becomes
[0, 0]
- red becomes
[0, 1]
虽然你有三个类别,但只有两个用于编码.Blue 在这种情况下被丢弃,因为它的存在没有信息价值——如果它存在,所有三列的总和总是为 1,这使得它们线性相关.删除蓝色类别的效果是它成为作为截距的一部分的基线,并且拟合模型预测将类别从蓝色更改为绿色 或从 blue 到 red 将在标签上.这种特定的编码选择有点随意——在我的系统上,red 和 green 的列交换了.
Although you have three categories, only two are used for the encoding. Blue in this case gets dropped since its presence has no information value - if it was there, all three columns will always sum to 1, which makes them linearly dependent. The effect of dropping the blue category is that it becomes the baseline as part of the intercept and the fitted model predicts what effect changing the category from blue to green or from blue to red will have on the label. That particular choice of encoding is a bit arbitrary - on my system the columns for red and green came out swapped.
value1
和 value2
是双精度值,因此它们在特征向量中保持不变.{color,value1}
是 color
特征和 value1
特征的乘积,因此是 one-hot 编码的乘积color
与标量 value1
,产生三个新特性.请注意,在这种情况下,我们不能删除一个类别,因为交互使基础"值依赖于交互中第二个特征的值.{color,value2}
也一样.所以你最终会得到 2 + 1 + 1 + 3 + 3 或 10 个特征.您在 show()
的输出中看到的是可用作其他 Spark ML 类输入的组合向量特征列.
value1
and value2
are doubles, so they go unchanged in the feature vector. {color,value1}
is the product of the color
feature and the value1
feature, so that is the product of the one-hot encoding of color
with the scalar value1
, resulting in three new features. Notice that in this case we cannot drop one category because the interaction makes the "base" value dependent on the value of the second feature in the interaction. Same for {color,value2}
. So you end up with 2 + 1 + 1 + 3 + 3 or 10 features in total. What you see in the output of show()
is the assembled vector feature column that can be used as input by other Spark ML classes.
这是如何读取第一行:
(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])
是
[1.0, 0, 1.0, 14.386294994851129, 1.0, 0, 0, 14.386294994851129, 0, 0]
|--1--| |2| |-------3--------| |---4---| |----------5-----------|
其中包含以下单个组件:
which contains the following individual components:
[1.0, 0, ...]
-color
,green类的one-hot编码(减去线性相关的第三类)>[..., 1.0, ...]
-value1
, value1
[..., 14.386294994851129, ...]
-value2
, value 14,38629...[..., 1.0, 0, 0, ...]
-color x value1
交互项,green的one-hot编码的乘积 ([1, 0, 0]
) 和 1[..., 14.386294994851129, 0, 0]
-color x value2
交互项,green的one-hot编码的产物([1, 0, 0]
) 和 14,38629...
[1.0, 0, ...]
-color
, one-hot encoding (minus the linearly dependent third category) of category green[..., 1.0, ...]
-value1
, value1
[..., 14.386294994851129, ...]
-value2
, value 14,38629...[..., 1.0, 0, 0, ...]
-color x value1
interaction term, product of one-hot encoding of green ([1, 0, 0]
) and 1[..., 14.386294994851129, 0, 0]
-color x value2
interaction term, product of one-hot encoding of green ([1, 0, 0]
) and 14,38629...
这篇关于Spark R公式解读的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!