将IndexToString应用于Spark中的特征向量 [英] Applying IndexToString to features vector in Spark
问题描述
上下文:我有一个数据框,其中所有分类值都已使用StringIndexer进行了索引.
Context: I have a data frame where all categorical values have been indexed using StringIndexer.
val categoricalColumns = df.schema.collect { case StructField(name, StringType, nullable, meta) => name }
val categoryIndexers = categoricalColumns.map {
col => new StringIndexer().setInputCol(col).setOutputCol(s"${col}Indexed")
}
然后我使用VectorAssembler对所有特征列(包括索引的分类特征列)进行向量化.
Then I used VectorAssembler to vectorize all feature columns (including the indexed categorical ones).
val assembler = new VectorAssembler()
.setInputCols(dfIndexed.columns.diff(List("label") ++ categoricalColumns))
.setOutputCol("features")
应用分类器和其他一些步骤后,我最终得到一个具有标签,功能和预测的数据框.我想将我的功能向量扩展到单独的列,以便将索引值转换回其原始String形式.
After applying the classifier and a few additional steps I end up with a data frame that has label, features, and prediction. I would like expand my features vector to separate columns in order to convert the indexed values back to their original String form.
val categoryConverters = categoricalColumns.zip(categoryIndexers).map {
colAndIndexer => new IndexToString().setInputCol(s"${colAndIndexer._1}Indexed").setOutputCol(colAndIndexer._1).setLabels(colAndIndexer._2.fit(df).labels)
}
问题:是否有简单方法来执行此操作,还是将预测列附加到测试数据帧的最佳方法?
Question: Is there a simple way of doing this, or is the best approach to somehow attach the prediction column to the test data frame?
我尝试过的事情:
val featureSlicers = categoricalColumns.map {
col => new VectorSlicer().setInputCol("features").setOutputCol(s"${col}Indexed").setNames(Array(s"${col}Indexed"))
}
应用此选项后,我会得到想要的列,但它们都是Vector形式的(按原意这样做),而不是Double.
Applying this gives me the columns that I want, but they are in Vector form (as it is meant to do) and not type Double.
所需的输出是原始数据帧(即分类特征为String而不是索引),并带有附加列指示预测的标签(在我的情况下为0或1).
The desired output is the original data frame (i.e. categorical features as String not index) with an additional column indicating the predicted label (which in my case is 0 or 1).
例如,说我的分类器的输出看起来像这样:
For example, say the output of my classifier looked something like this:
+-----+---------+----------+
|label| features|prediction|
+-----+---------+----------+
| 1.0|[0.0,3.0]| 1.0|
+-----+---------+----------+
通过将VectorSlicer应用于每个功能,我会得到:
By applying VectorSlicer on each feature I would get:
+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
| 1.0|[0.0,3.0]| 1.0| [0.0]| [3.0]|
+-----+---------+----------+-------------+-------------+
哪个很棒,但我需要:
+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
| 1.0|[0.0,3.0]| 1.0| 0.0 | 3.0 |
+-----+---------+----------+-------------+-------------+
然后可以使用IndexToString并将其转换为:
To then be able to use IndexToString and convert it to:
+-----+---------+----------+-------------+-------------+
|label| features|prediction| status | artist |
+-----+---------+----------+-------------+-------------+
| 1.0|[0.0,3.0]| 1.0| good | Pink Floyd |
+-----+---------+----------+-------------+-------------+
甚至:
+-----+----------+-------------+-------------+
|label|prediction| status | artist |
+-----+----------+-------------+-------------+
| 1.0| 1.0| good | Pink Floyd |
+-----+----------+-------------+-------------+
推荐答案
嗯,这不是一个非常有用的操作,但是应该可以使用列元数据和简单的UDF提取所需的信息.我假设您的数据已创建为类似于以下内容的管道:
Well, it is not a very useful operation but it should be possible to extract required information using column metadata and as simple UDF. I assume your data has been created a pipeline similar to this one:
import org.apache.spark.ml.feature.{VectorSlicer, VectorAssembler, StringIndexer}
import org.apache.spark.ml.Pipeline
val df = sc.parallelize(Seq(
(1L, "a", "foo", 1.0), (2L, "b", "bar", 2.0), (3L, "a", "bar", 3.0)
)).toDF("id", "x1", "x2", "x3")
val featureCols = Array("x1", "x2", "x3")
val featureColsIdx = featureCols.map(c => s"${c}_i")
val indexers = featureCols.map(
c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_i")
)
val assembler = new VectorAssembler()
.setInputCols(featureColsIdx)
.setOutputCol("features")
val slicer = new VectorSlicer()
.setInputCol("features")
.setOutputCol("string_features")
.setNames(featureColsIdx.init)
val transformed = new Pipeline()
.setStages(indexers :+ assembler :+ slicer)
.fit(df)
.transform(df)
首先,我们可以从功能中提取所需的元数据:
First we can extract desired metadata from the features:
val meta = transformed.select($"string_features")
.schema.fields.head.metadata
.getMetadata("ml_attr")
.getMetadata("attrs")
.getMetadataArray("nominal")
并将其转换为更易于使用的
and convert it to something easier to use
case class NominalMetadataWrapper(idx: Long, name: String, vals: Array[String])
// In general it could a good idea to make it a broadcast variable
val lookup = meta.map(m => NominalMetadataWrapper(
m.getLong("idx"), m.getString("name"), m.getStringArray("vals")
))
最后是一个小的UDF:
Finally a small UDF:
import scala.util.Try
val transFeatures = udf((v: Vector) => lookup.map{
m => Try(m.vals(v(m.idx.toInt).toInt)).toOption
})
transformed.select(transFeatures($"string_features")).
这篇关于将IndexToString应用于Spark中的特征向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!