归咎于该数据集类的标签意味着过滤操作导致崩溃 [英] Imputing the dataset with mean of class label causing crash in filter operation

查看:165
本文介绍了归咎于该数据集类的标签意味着过滤操作导致崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含数值的CSV文件。

I have a csv file that contains numeric values.

val row = withoutHeader.map{
  line => {
  val arr = line.split(',')
  for (h <- 0 until arr.length){
     if(arr(h).trim == ""){
        val abc = avgrdd.filter {case ((x,y),z) => x == h && y == arr(dependent_col_index).toDouble} //crashing here
    arr(h) = //imputing with the value above    
     }
  }
  arr.mkString(",")
  }
}

这是code,其中我试图用类标签的平均归咎于缺少的值的片段。

This is a snippet of the code where I am trying to impute the missing values with the mean of class labels.

avgrdd包含的平均水平键值对,其中关键是列索引和类标签值。这avgrdd使用,我看到的是正确的计算结果的计算合

avgrdd contains the average for the key value pairs where key is column index and the class label value. This avgrdd is calculated using the combiners which I see is calculating the results correctly.

dependent_col_index是包含类标签列

dependent_col_index is the column containing the class labels.

用滤波器的线与空指针异常崩溃。
在除去该行原数组是输出(逗号分隔)。

The line with filter is crashing with the null pointer exception. On removing this line the original array is the output (comma separated).

我很困惑,为什么过滤操作是导致飞机坠毁。

I am confused why the filter operation is causing a crash.

请就如何解决此问题的建议。

Please suggest on how to fix this issue.

示例

col1,dependent_col_index
4,1
8,0
 ,1
21,1
21,0
 ,1
25,1
 ,0
34,1

mean for class 1 is 84/4 = 21 and for class 0 is 29/2 = 14.5

Required Output
4,1
8,0
21,1
21,1
21,0
21,1
25,1
14.5,0
34,1

谢谢!

推荐答案

您正试图执行另一个RDD转型的内RDD转变。请记住,你不能使用RDD另一个RDD改造里面,这会导致错误。

You are trying to execute a RDD transformation inside of another RDD transformation. Remember that you cannot use RDD inside of another RDD transformation, this would cause an error.

进行的方法如下:


  1. 变换源RDD withoutHeader 来对的RDD &LT;班级,价值与GT; 的corrent型(长在你的情况)。缓存它

  2. 计算 avgrdd withoutHeader 的顶部。这应该是对的RDD &LT;班级,AvgValue中&GT;

  3. 加入 withoutHeader RDD和 avgrdd 在一起 - 这种方式对每一行,你将有一个结构&LT;类,&LT;值,AvgValue中&GT;&GT;

  4. 执行地图对结果的顶部更换缺少 AvgValue中

  1. Transform the source RDD withoutHeader to the RDD of pairs <Class, Value> of the corrent type (Long in your case). Cache it
  2. Calculate avgrdd on top of withoutHeader. This should be an RDD of pairs <Class, AvgValue>
  3. Join withoutHeader RDD and avgrdd together - this way for each row you would have a structure <Class, <Value, AvgValue>>
  4. Execute map on top of the result to replace missing Value with AvgValue

另一个选择可能是分裂RDD两部分的第3步(一部分 - RDD缺失值,第二个 - RDD具有非缺失值),加入 avgrdd 只只包含遗漏值的RDD后,使这两个部件之间的结合。如果有缺失值的一小部分会更快

Another option might be to split the RDD in two parts on step 3 (one part - RDD with missing values, second one - RDD with non-missing values), join the avgrdd only with the RDD containing only missing values and after that make a union between this two parts. It would be faster if you have a small fraction of missing values

这篇关于归咎于该数据集类的标签意味着过滤操作导致崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆