RDD到LabeledPoint转换 [英] RDD to LabeledPoint conversion

查看：5272 发布时间：2016/5/22 16:02:15 scala apache-spark rdd spark-dataframe apache-spark-mllib

本文介绍了RDD到LabeledPoint转换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果我有大约500列和200万行，而RDD RDD.columns.indexOf（目标，0）显示 INT = 77 它告诉我，我有针对性的因变量是列数77，但我并没有关于如何选择所需要的（部分）列足够的知识作为特征（比如我想从23列59，111至357，399至489）。我想知道如果我能将这样的：

If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0) shows Int = 77 which tells me my targeted dependent variable is at column number 77. But I don't have enough knowledge on how to select desired (partial) columns as features (say I want columns from 23 to 59, 111 to 357, 399 to 489). I am wondering if I can apply such:

val data = rdd.map(col => new LabeledPoint(
    col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray))

任何建议或指导，将大大AP preciated。

Any suggestions or guidance will be much appreciated.

也许我搞砸了RDD与DataFRrame，我可以向RDD转换数据框与 .toDF（）或更容易与数据框比RDD来实现这一目标。

Maybe I messed up RDD with DataFRrame, I can convert the rdd to DataFrame with .toDF() or it is easier to accomplish the goal with DataFrame than RDD.

推荐答案

我假设你的数据看起来或多或少是这样的：

I assume your data looks more or less like this:

import scala.util.Random.{setSeed, nextDouble}
setSeed(1)

case class Record(
    foo: Double, target: Double, x1: Double, x2: Double, x3: Double)

val rows = sc.parallelize(
    (1 to 10).map(_ => Record(
        nextDouble, nextDouble, nextDouble, nextDouble, nextDouble
   ))
)
val df = sqlContext.createDataFrame(rows)
df.registerTempTable("df")

sqlContext.sql("""
  SELECT ROUND(foo, 2) foo,
         ROUND(target, 2) target,
         ROUND(x1, 2) x1,
         ROUND(x2, 2) x2,
         ROUND(x2, 2) x3 
  FROM df""").show

因此，我们有如下数据：

So we have data as below:

+----+------+----+----+----+
| foo|target|  x1|  x2|  x3|
+----+------+----+----+----+
|0.73|  0.41|0.21|0.33|0.33|
|0.01|  0.96|0.94|0.95|0.95|
| 0.4|  0.35|0.29|0.51|0.51|
|0.77|  0.66|0.16|0.38|0.38|
|0.69|  0.81|0.01|0.52|0.52|
|0.14|  0.48|0.54|0.58|0.58|
|0.62|  0.18|0.01|0.16|0.16|
|0.54|  0.97|0.25|0.39|0.39|
|0.43|  0.23|0.89|0.04|0.04|
|0.66|  0.12|0.65|0.98|0.98|
+----+------+----+----+----+

和我们想忽略富和 X2 并提取 LabeledPoint（目标，数组（X1，X3））：

and we want to ignore foo and x2 and extract LabeledPoint(target, Array(x1, x3)):

// Map feature names to indices
val featInd = List("x1", "x3").map(df.columns.indexOf(_))

// Or if you want to exclude columns
val ignored = List("foo", "target", "x2")
val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_))

// Get index of target
val targetInd = df.columns.indexOf("target") 

df.rdd.map(r => LabeledPoint(
   r.getDouble(targetInd), // Get target value
   // Map feature indices to values
   Vectors.dense(featInd.map(r.getDouble(_)).toArray) 
))

这篇关于RDD到LabeledPoint转换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

RDD到LabeledPoint转换 [英] RDD to LabeledPoint conversion

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

RDD到LabeledPoint转换 [英] RDD to LabeledPoint conversion

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭