星火ML词汇特征 [英] Spark ML Categorical Features

查看：284 发布时间：2016/5/22 15:18:47 apache-spark categorical-data apache-spark-ml apache-spark-mllib

本文介绍了星火ML词汇特征的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我如何处理星火ML（不MLLib）分类数据？
虽然文档不是很清楚，好像是分类（如RandomForestClassifier，逻辑回归等）有一个featuresCol的说法，它指定的数据框功能列的名称，而labelCol的说法，它指定标记类的数据帧中的列的名称

How do I handle categorical data in Spark ML (not MLLib)? Though the documentation is not very clear, it seems that the classifier (e.g. RandomForestClassifier, LogisticRegression, etc) has a "featuresCol" argument, which specifies the name of the column of features in the dataframe, and a "labelCol" argument, which specifies the name of the column of labeled classes in the data frame.

显然，我想在我的prediction使用一个以上的功能，所以我尝试使用VectorAssembler把我所有的功能在featuresCol下单载体。然而，VectorAssembler只接受数字类型，布尔类型和载体类型（根据星火网站），所以我不能把串在我的特征向量。

Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol. However, the VectorAssembler only accepts numeric types, boolean type, and vector type (according to the Spark website), so I can't put strings in my features vector.

我应该如何进行？

推荐答案

我只是想完成霍顿的答案。

I just wanted to complete Holden's answer.

由于星火1.4.0，MLLib还提供 OneHotEn codeR 功能。

Since Spark 1.4.0, MLLib also supplies OneHotEncoder feature.

一热编码标签索引的列映射到二进制矢量的一列，至多单个一值。这种编码算法可以期待其连续特性，如Logistic回归，使用类别特征

One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features

考虑以下数据框中：

val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c")).toDF("id", "category")

第一步是创建与StringIndexer索引数据框：

The first step would be to create the indexed DataFrame with the StringIndexer:

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
val indexed = indexer.transform(df)

您可以再带code中的 categoryIndex 与OneHotEn code：

You can then encode the categoryIndex with OneHotEncode :

import org.apache.spark.ml.feature.OneHotEncoder

val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
encoded.select("id", "categoryVec").show

这篇关于星火ML词汇特征的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

星火ML词汇特征 [英] Spark ML Categorical Features

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

星火ML词汇特征 [英] Spark ML Categorical Features

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭