星火ML词汇特征 [英] Spark ML Categorical Features

查看:284
本文介绍了星火ML词汇特征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何处理星火ML(不MLLib)分类数据?
虽然文档不是很清楚,好像是分类(如RandomForestClassifier,逻辑回归等)有一个featuresCol的说法,它指定的数据框功能列的名称,而labelCol的说法,它指定标记类的数据帧中的列的名称

How do I handle categorical data in Spark ML (not MLLib)? Though the documentation is not very clear, it seems that the classifier (e.g. RandomForestClassifier, LogisticRegression, etc) has a "featuresCol" argument, which specifies the name of the column of features in the dataframe, and a "labelCol" argument, which specifies the name of the column of labeled classes in the data frame.

显然,我想在我的prediction使用一个以上的功能,所以我尝试使用VectorAssembler把我所有的功能在featuresCol下单载体。然而,VectorAssembler只接受数字类型,布尔类型和载体类型(根据星火网站),所以我不能把串在我的特征向量。

Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol. However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.

我应该如何进行?

推荐答案

我只是想完成霍顿的答案。

I just wanted to complete Holden's answer.

由于星火1.4.0,MLLib还提供 OneHotEn codeR 功能。

Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature.

一热编码标签索引的列映射到二进制矢量的一列,至多单个一值。这种编码算法可以期待其连续特性,如Logistic回归,使用类别特征

One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features

考虑以下数据框中:

val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c")).toDF("id", "category")

第一步是创建与StringIndexer索引数据框:

The first step would be to create the indexed DataFrame with the StringIndexer:

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)

您可以再带code中的 categoryIndex 与OneHotEn code:

You can then encode the categoryIndex with OneHotEncode :

import org.apache.spark.ml.feature.OneHotEncoder

val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show

这篇关于星火ML词汇特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆