星火ML词汇特征 [英] Spark ML Categorical Features
问题描述
我如何处理星火ML(不MLLib)分类数据?
虽然文档不是很清楚,好像是分类(如RandomForestClassifier,逻辑回归等)有一个featuresCol的说法,它指定的数据框功能列的名称,而labelCol的说法,它指定标记类的数据帧中的列的名称
How do I handle categorical data in Spark ML (not MLLib)? Though the documentation is not very clear, it seems that the classifier (e.g. RandomForestClassifier, LogisticRegression, etc) has a "featuresCol" argument, which specifies the name of the column of features in the dataframe, and a "labelCol" argument, which specifies the name of the column of labeled classes in the data frame.
显然,我想在我的prediction使用一个以上的功能,所以我尝试使用VectorAssembler把我所有的功能在featuresCol下单载体。然而,VectorAssembler只接受数字类型,布尔类型和载体类型(根据星火网站),所以我不能把串在我的特征向量。
Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol. However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.
我应该如何进行?
推荐答案
我只是想完成霍顿的答案。
I just wanted to complete Holden's answer.
由于星火1.4.0,MLLib还提供 OneHotEn codeR 功能。
Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature.
一热编码标签索引的列映射到二进制矢量的一列,至多单个一值。这种编码算法可以期待其连续特性,如Logistic回归,使用类别特征
One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
考虑以下数据框中:
val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c")).toDF("id", "category")
第一步是创建与StringIndexer索引数据框:
The first step would be to create the indexed DataFrame with the StringIndexer:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)
您可以再带code中的 categoryIndex
与OneHotEn code:
You can then encode the categoryIndex
with OneHotEncode :
import org.apache.spark.ml.feature.OneHotEncoder
val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show
这篇关于星火ML词汇特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!