如何将一个分类变量星火成一组列codeD作为{0,1}? [英] How to transform a categorical variable in Spark into a set of columns coded as {0,1}?

查看:270
本文介绍了如何将一个分类变量星火成一组列codeD作为{0,1}?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对包含分类变量的数据集执行逻辑回归(LogisticRegressionWithLBFGS)与星火MLlib(使用Scala)。我发现星火无法与类型的变量工作。

在R里面有一个简单的方法来处理那样的问题:我改造变量因子(类别),以r创建一组列codeD,因为{0,1}指针变量

我如何与星火执行此?


解决方案

如果我正确理解你不想为1类别列在多个虚拟列转换。你想激发明白,列分类,而不是数字。

我认为这取决于你想现在使用的算法。例如随机森林和GBT兼得categoricalFeaturesInfo作为参数点击此处查看:

<一个href=\"https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.tree.RandomForest\" rel=\"nofollow\">https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$

因此​​,例如:

categoricalFeaturesInfo =地图[INT,INT((1,2),(2,5))

实际上是说,你的功能(索引开始于0,因此,1是第二列)第二列是一个绝对的一个用2级,和第3也与5级一个明确的功能。当你训练你的随机森林或GBT您可以指定这些参数。

您需要确保你的级别被映射到0,1,2 ......所以如果你有类似的信息(好,中等,坏),它映射到(0,1,2)

现在你的情况,你要使用LogisticRegressionWithLBFGS。在这种情况下,我的建议是真正转化类别列到虚拟列。例如用3个级别的单个柱(好,中,坏)转换成3列与1/0取决于哪一个命中。我没有一个例子用所以这里的工作是Scala样本code,应该工作:

  VAL dummygen =(数据:数据帧,西:数组[字符串])=&GT; {
    变种临时=数据
    为(ⅰ&下; - 0直到col.length){
      VAL N = data.select(COL(I))。distinct.count.toInt
      对于(J&LT; - 0到N)
      TEMP = temp.withColumn(COL(我)+_+ j.toString,callUDF(指数(J),DoubleType,数据(COL(I))))
    }
  温度
  }
  VAL指数=(价值:双人间)=&GT; {(A:双人间)=&GT; {
    如果(价值==一){
      1
    }其他{
      0
    }
  }}

,你可以这样调用它:

  VAL结果= dummygen(数据阵列(CategoricalColumn1,CategoricalColumn2))

在这里,我做它类别列的列表(以防万一你在功能列表有超过1)。一是循环经过每个分类栏,第二个循环经过在列每个级别,并创建一个列数等于水平为每列数。

重要!它假定您首先将它们映射到0,1,2 ...

您可以再使用这个新的功能集运行LogisticRegressionWithLBFGS。这种方法还有助于SVM。

I'm trying to perform a logistic regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) on a dataset which contains categorical variables. I discover Spark was not able to work with that kind of variable.

In R there is a simple way to deal with that kind of problem : I transform the variable in factor (categories), so R creates a set of columns coded as {0,1} indicator variables.

How can I perform this with Spark?

解决方案

If I understood correctly you do not want to convert 1 categorical column in several dummy columns. You want spark to understand that the column is categorical and not numerical.

I think it depends on the algorithm you want to use right now. For example random Forest and GBT have both categoricalFeaturesInfo as a parameter check it here:

https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$

so for example:

categoricalFeaturesInfo = Map[Int, Int]((1,2),(2,5))

is actually saying that second column of your features (index starts in 0, so 1 is second column) is a categorical one with 2 levels, and 3rd is also a categorical feature with 5 levels. You can specify these parameters when you train your randomForest or GBT.

You need to make sure your levels are mapped to 0,1,2... so if you have something like ("good","medium","bad") map it into (0,1,2).

Now in your case you want to use LogisticRegressionWithLBFGS. In this case my suggestion is to actually transform categorical columns into dummy columns. For example a single column with 3 levels ("good","medium","bad") into 3 columns with 1/0 depending on which one hits. I do not have an example to work with so here is a sample code in scala that should work:

val dummygen = (data : DataFrame, col:Array[String]) => {
    var temp = data
    for(i <- 0 until col.length) {
      val N = data.select(col(i)).distinct.count.toInt
      for (j<- 0 until N)
      temp = temp.withColumn(col(i) + "_" + j.toString, callUDF(index(j), DoubleType, data(col(i))))
    }
  temp
  }
  val index = (value:Double) => {(a:Double) => {
    if (value==a) {
      1
    } else{
      0
    }
  }}

That you can call it like:

val results = dummygen(data, Array("CategoricalColumn1","CategoricalColumn2"))

Here I do it for a list of Categorical Columns (just in case you have more than 1 in your features list). First "for loop" goes through each categorical column, second "for loop" goes through each level in the column and creates a number of columns equals to the number of levels for each column.

Important!!! that it assumes that you first mapped them to 0,1,2...

You can then run your LogisticRegressionWithLBFGS using this new features set. This approach also helps with SVM.

这篇关于如何将一个分类变量星火成一组列codeD作为{0,1}?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆