什么是pyspark到scala.util.Try等价? [英] What is the equivalent to scala.util.Try in pyspark?

查看:207
本文介绍了什么是pyspark到scala.util.Try等价?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个糟糕的httpd访问日志,只是想跳过糟糕行。

I've got a lousy HTTPD access_log and just want to skip the "lousy" lines.

在斯卡拉这是简单的:

import scala.util.Try

val log = sc.textFile("access_log")

log.map(_.split(' ')).map(a => Try(a(8))).filter(_.isSuccess).map(_.get).map(code => (code,1)).reduceByKey(_ + _).collect()

有关蟒蛇,我通过使用拉姆达表示法的对比明确定义的函数得到了以下解决方案:

For python I've got the following solution by explicitly defining a function in contrast using the "lambda" notation:

log = sc.textFile("access_log")

def wrapException(a):
    try:
        return a[8]
    except:
        return 'error'

log.map(lambda s : s.split(' ')).map(wrapException).filter(lambda s : s!='error').map(lambda code : (code,1)).reduceByKey(lambda acu,value : acu + value).collect()

在pyspark有没有更好的办法做这个(例如像斯卡拉)?

Is there a better way doing this (e.g. like in Scala) in pyspark?

非常感谢!

推荐答案

更好的是一种主观的词,但也有您可以尝试几种方法。

Better is a subjective term but there are a few approaches you can try.


  • 您可以在此特殊情况下最简单的办法是避免任何异常。所有你需要的是一个 flatMap 和一些片断:

log.flatMap(lambda s : s.split(' ')[8:9])

正如你可以看到它意味着无需异常处理或后续过滤器

previous想法可以用一个简单的包装延长

Previous idea can be extended with a simple wrapper

def seq_try(f, *args, **kwargs):
    try:
        return [f(*args, **kwargs)]
    except:
        return []

和用法示例

from operator import div # FYI operator provides getitem as well.

rdd = sc.parallelize([1, 2, 0, 3, 0, 5, "foo"])

rdd.flatMap(lambda x: seq_try(div, 1., x)).collect()
## [1.0, 0.5, 0.3333333333333333, 0.2]


  • 最后更面向对象方法:

  • finally more OO approach:

    import inspect as _inspect
    
    class _Try(object): pass    
    
    class Failure(_Try):
        def __init__(self, e):
            if Exception not in _inspect.getmro(e.__class__):
                msg = "Invalid type for Failure: {0}"
                raise TypeError(msg.format(e.__class__))
            self._e = e
            self.isSuccess =  False
            self.isFailure = True
    
        def get(self): raise self._e
    
        def __repr__(self):
            return "Failure({0})".format(repr(self._e))
    
    class Success(_Try):
        def __init__(self, v):
            self._v = v
            self.isSuccess = True
            self.isFailure = False
    
        def get(self): return self._v
    
        def __repr__(self):
            return "Success({0})".format(repr(self._v))
    
    def Try(f, *args, **kwargs):
        try:
            return Success(f(*args, **kwargs))
        except Exception as e:
            return Failure(e)
    

    和用法示例:

    tries = rdd.map(lambda x: Try(div, 1.0, x))
    tries.collect()
    ## [Success(1.0),
    ##  Success(0.5),
    ##  Failure(ZeroDivisionError('float division by zero',)),
    ##  Success(0.3333333333333333),
    ##  Failure(ZeroDivisionError('float division by zero',)),
    ##  Success(0.2),
    ##  Failure(TypeError("unsupported operand type(s) for /: 'float' and 'str'",))]
    
    tries.filter(lambda x: x.isSuccess).map(lambda x: x.get()).collect()
    ## [1.0, 0.5, 0.3333333333333333, 0.2]
    

    您甚至可以使用图案 multipledispatch

    You can even use pattern matching with multipledispatch

    from multipledispatch import dispatch
    from operator import getitem
    
    @dispatch(Success)
    def check(x): return "Another great success"
    
    @dispatch(Failure)
    def check(x): return "What a failure"
    
    a_list = [1, 2, 3]
    
    check(Try(getitem, a_list, 1))
    ## 'Another great success'
    
    check(Try(getitem, a_list, 10)) 
    ## 'What a failure'
    

    如果你喜欢这种方式,我推了一点点的完整实现,以 GitHub上和<一个HREF =htt​​ps://pypi.python.org/pypi/tryingsnake/0.2.5相对=nofollow>的PyPI 。

    If you like this approach I've pushed a little bit more complete implementation to GitHub and pypi.

    这篇关于什么是pyspark到scala.util.Try等价?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆