Apache Spark:在RDD中处理Option / Some / None [英] Apache Spark: dealing with Option/Some/None in RDDs

查看:601
本文介绍了Apache Spark:在RDD中处理Option / Some / None的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在映射一个HBase表,在每个HBase行中生成一个RDD元素。但是,有时该行的数据不正确(在解析代码中抛出NullPointerException),在这种情况下,我只想跳过它。

I'm mapping over an HBase table, generating one RDD element per HBase row. However, sometimes the row has bad data (throwing a NullPointerException in the parsing code), in which case I just want to skip it.

我让我的初始映射器返回一个 Option 表示它返回0或1个元素,然后过滤 Some ,然后获取包含的值:

I have my initial mapper return an Option to indicate that it returns 0 or 1 elements, then filter for Some, then get the contained value:

// myRDD is RDD[(ImmutableBytesWritable, Result)]
val output = myRDD.
  map( tuple => getData(tuple._2) ).
  filter( {case Some(y) => true; case None => false} ).
  map( _.get ).
  // ... more RDD operations with the good data

def getData(r: Result) = {
  val key = r.getRow
  var id = "(unk)"
  var x = -1L

  try {
    id = Bytes.toString(key, 0, 11)
    x = Long.MaxValue - Bytes.toLong(key, 11)
    // ... more code that might throw exceptions

    Some( ( id, ( List(x),
          // more stuff ...
        ) ) )
  } catch {
    case e: NullPointerException => {
      logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e)
      None
    }
  }
}

是否有更惯用的方法来缩短操作时间?我觉得这看起来很混乱,在 getData() map.filter.map 舞蹈中,

Is there a more idiomatic way to do this that's shorter? I feel like this looks pretty messy, both in getData() and in the map.filter.map dance I'm doing.

也许 flatMap 可以工作(在中生成0或1个项目Seq ),但是我不希望它变平我在map函数中创建的元组,只需消除空即可。

Perhaps a flatMap could work (generate 0 or 1 items in a Seq), but I don't want it to flatten the tuples I'm creating in the map function, just eliminate empties.

推荐答案

如果将 getData 更改为返回 scala.util.Try ,则您可以大大简化您的转换。这样的事情可能会起作用:

If you change your getData to return a scala.util.Try then you can simplify your transformations considerably. Something like this could work:

def getData(r: Result) = {
  val key = r.getRow
  var id = "(unk)"
  var x = -1L

  val tr = util.Try{
    id = Bytes.toString(key, 0, 11)
    x = Long.MaxValue - Bytes.toLong(key, 11)
    // ... more code that might throw exceptions

    ( id, ( List(x)
          // more stuff ...
     ) )
  } 

  tr.failed.foreach(e => logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e))
  tr
}

然后您的转换就可以这样开始:

Then your transform could start like so:

myRDD.
  flatMap(tuple => getData(tuple._2).toOption)

如果尝试是失败的 ,它将变成 None 通过 toOption 然后作为 flatMap 逻辑的一部分删除。到那时,转换的下一步将仅在成功的案例中使用,即无论从 getData 返回的基础类型是什么,而无需换行(即,否选项

If your Try is a Failure it will be turned into a None via toOption and then removed as part of the flatMap logic. At that point, your next step in the transform will only be working with the successful cases being whatever the underlying type is that is returned from getData without the wrapping (i.e. No Option)

这篇关于Apache Spark:在RDD中处理Option / Some / None的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆