Apache Spark:在RDD中处理Option / Some / None [英] Apache Spark: dealing with Option/Some/None in RDDs
问题描述
我正在映射一个HBase表,在每个HBase行中生成一个RDD元素。但是,有时该行的数据不正确(在解析代码中抛出NullPointerException),在这种情况下,我只想跳过它。
I'm mapping over an HBase table, generating one RDD element per HBase row. However, sometimes the row has bad data (throwing a NullPointerException in the parsing code), in which case I just want to skip it.
我让我的初始映射器返回一个 Option
表示它返回0或1个元素,然后过滤 Some
,然后获取包含的值:
I have my initial mapper return an Option
to indicate that it returns 0 or 1 elements, then filter for Some
, then get the contained value:
// myRDD is RDD[(ImmutableBytesWritable, Result)]
val output = myRDD.
map( tuple => getData(tuple._2) ).
filter( {case Some(y) => true; case None => false} ).
map( _.get ).
// ... more RDD operations with the good data
def getData(r: Result) = {
val key = r.getRow
var id = "(unk)"
var x = -1L
try {
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
Some( ( id, ( List(x),
// more stuff ...
) ) )
} catch {
case e: NullPointerException => {
logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e)
None
}
}
}
是否有更惯用的方法来缩短操作时间?我觉得这看起来很混乱,在 getData()
和 map.filter.map
舞蹈中,
Is there a more idiomatic way to do this that's shorter? I feel like this looks pretty messy, both in getData()
and in the map.filter.map
dance I'm doing.
也许 flatMap
可以工作(在中生成0或1个项目Seq
),但是我不希望它变平我在map函数中创建的元组,只需消除空即可。
Perhaps a flatMap
could work (generate 0 or 1 items in a Seq
), but I don't want it to flatten the tuples I'm creating in the map function, just eliminate empties.
推荐答案
如果将 getData
更改为返回 scala.util.Try
,则您可以大大简化您的转换。这样的事情可能会起作用:
If you change your getData
to return a scala.util.Try
then you can simplify your transformations considerably. Something like this could work:
def getData(r: Result) = {
val key = r.getRow
var id = "(unk)"
var x = -1L
val tr = util.Try{
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
( id, ( List(x)
// more stuff ...
) )
}
tr.failed.foreach(e => logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e))
tr
}
然后您的转换就可以这样开始:
Then your transform could start like so:
myRDD.
flatMap(tuple => getData(tuple._2).toOption)
如果尝试
是失败的
,它将变成 None
通过 toOption
然后作为 flatMap
逻辑的一部分删除。到那时,转换的下一步将仅在成功的案例中使用,即无论从 getData
返回的基础类型是什么,而无需换行(即,否选项
)
If your Try
is a Failure
it will be turned into a None
via toOption
and then removed as part of the flatMap
logic. At that point, your next step in the transform will only be working with the successful cases being whatever the underlying type is that is returned from getData
without the wrapping (i.e. No Option
)
这篇关于Apache Spark:在RDD中处理Option / Some / None的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!