我可以将多个文件从 S3 读入 Spark 数据帧,并忽略不存在的文件吗? [英] Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

查看:18
本文介绍了我可以将多个文件从 S3 读入 Spark 数据帧,并忽略不存在的文件吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从 S3 将多个镶木地板文件读入数据帧.目前,我使用以下方法来做到这一点:

I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this:

files = ['s3a://dev/2017/01/03/data.parquet',
         's3a://dev/2017/01/02/data.parquet']
df = session.read.parquet(*files)

如果所有文件都存在于 S3 上,这有效,但我想要求将文件列表加载到数据帧中,而不会在列表中的某些文件不存在时中断.换句话说,我希望 sparkSql 将它找到的尽可能多的文件加载到数据框中,并返回此结果而不会抱怨.这可能吗?

This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as it finds into the dataframe, and return this result without complaining. Is this possible?

推荐答案

是的,如果您将指定输入的方法更改为 hadoop glob 模式,例如:

Yes, it's possible if you change method of specifying input to hadoop glob pattern, for example:

files = 's3a://dev/2017/01/{02,03}/data.parquet'
df = session.read.parquet(files)

您可以在 Hadoop javadoc.

但是,在我看来,这不是处理按时间分区的数据(在您的情况下是按天)的优雅方式.如果您能够像这样重命名目录:

But, in my opinion this isn't elegant way of working with data partitioned by time (by day in your case). If you are able to rename directories like this:

  • s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet
  • s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet
  • s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet
  • s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet

然后您可以利用 spark 分区 架构和读取数据:

then you can take advantage of spark partitioning schema and read data by:

session.read.parquet('s3a://dev/') \
    .where(col('day').between('2017-01-02', '2017-01-03')

这种方式也将省略空/不存在的目录.附加列 day 将出现在您的数据框中(它在 spark <2.1.0 中为字符串,在 spark >= 2.1.0 中为日期时间),因此您将知道每条记录存在于哪个目录中.

This way will omit empty/non-existing directories as well. Additionall column day will appear in your dataframe (it will be string in spark <2.1.0 and datetime in spark >= 2.1.0), so you will know in which directory each record exists.

这篇关于我可以将多个文件从 S3 读入 Spark 数据帧,并忽略不存在的文件吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆