归咎于该数据集类的标签意味着过滤操作导致崩溃 [英] Imputing the dataset with mean of class label causing crash in filter operation

查看：165 发布时间：2016/5/22 16:48:43 scala apache-spark

本文介绍了归咎于该数据集类的标签意味着过滤操作导致崩溃的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含数值的CSV文件。

I have a csv file that contains numeric values.

val row = withoutHeader.map{
  line => {
  val arr = line.split(',')
  for (h <- 0 until arr.length){
     if(arr(h).trim == ""){
        val abc = avgrdd.filter {case ((x,y),z) => x == h && y == arr(dependent_col_index).toDouble} //crashing here
    arr(h) = //imputing with the value above    
     }
  }
  arr.mkString(",")
  }
}

这是code，其中我试图用类标签的平均归咎于缺少的值的片段。

This is a snippet of the code where I am trying to impute the missing values with the mean of class labels.

avgrdd包含的平均水平键值对，其中关键是列索引和类标签值。这avgrdd使用，我看到的是正确的计算结果的计算合

avgrdd contains the average for the key value pairs where key is column index and the class label value. This avgrdd is calculated using the combiners which I see is calculating the results correctly.

dependent_col_index是包含类标签列

dependent_col_index is the column containing the class labels.

用滤波器的线与空指针异常崩溃。
在除去该行原数组是输出（逗号分隔）。

The line with filter is crashing with the null pointer exception. On removing this line the original array is the output (comma separated).

我很困惑，为什么过滤操作是导致飞机坠毁。

I am confused why the filter operation is causing a crash.

请就如何解决此问题的建议。

Please suggest on how to fix this issue.

示例

col1,dependent_col_index
4,1
8,0
 ,1
21,1
21,0
 ,1
25,1
 ,0
34,1

mean for class 1 is 84/4 = 21 and for class 0 is 29/2 = 14.5

Required Output
4,1
8,0
21,1
21,1
21,0
21,1
25,1
14.5,0
34,1

谢谢！

归咎于该数据集类的标签意味着过滤操作导致崩溃 [英] Imputing the dataset with mean of class label causing crash in filter operation

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

归咎于该数据集类的标签意味着过滤操作导致崩溃 [英] Imputing the dataset with mean of class label causing crash in filter operation

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭