Pyspark无效的输入异常尝试除错误 [英] Pyspark Invalid Input Exception try except error

查看:152
本文介绍了Pyspark无效的输入异常尝试除错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用pyspark从s3读取最近4个月的数据并处理数据,但收到以下异常。

I am trying to read the last 4 months of data from s3 using pyspark and process the data but am receiving the following exception.


org.apache.hadoop.mapred.InvalidInputException:输入模式s3:// path_to_clickstream / date = 201508 *

org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3://path_to_clickstream/date=201508*

第一天由于每月在s3路径中没有条目(一个单独的作业处理并将数据上传到s3路径,而我的作业在该路径之前运行),导致该作业失败。我想知道是否有办法让我捕获此异常并允许作业继续处理存在的所有路径?

On the first day of each month due to there not being an entry in the s3 path (a separate job processes and uploads data onto the s3 path and my job runs before that one), the job fails. I was wondering if there was a way for me to catch this exception and allow the job to continue processing all the paths that exist?

推荐答案

您可以在加载后立即尝试触发廉价操作并捕获 Py4JJavaError

You can simply try to trigger a cheap action just after the load and catch Py4JJavaError:

from py4j.protocol import Py4JJavaError

def try_load(path):
    rdd = sc.textFile(path)
    try:
        rdd.first()
        return rdd
    except Py4JJavaError as e:
        return sc.emptyRDD()

rdd = try_load(s3_path)
if not rdd.isEmpty():
    run_the_rest_of_your_code(rdd)

编辑

如果要处理多个路径,则可以分别处理每个路径并合并结果:

If you want to handle multiple paths you can process each one separately and combine the results:

paths = [
    "s3://path_to_inputdir/month1*/",
    "s3://path_to_inputdir/month2*/",
    "s3://path_to_inpu‌​tdir/month3*/"]

rdds = sc.union([try_load(path) for path in paths])

如果更好地控制您可以列出内容并加载已知文件。

If you want a better control you can list content and load known files.

如果这些路径中至少有一个是非空的,那么您应该可以使事情变得更简单,并使用如下所示的glob:

If at least one of theses paths is non-empty you should be able to make things even simpler and use glob like this:

sc.textFile("s3://path_to_inputdir/month[1-3]*/")

这篇关于Pyspark无效的输入异常尝试除错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆