我可以将多个文件从S3读入Spark Dataframe中,然后传递不存在的文件吗? [英] Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

查看:89
本文介绍了我可以将多个文件从S3读入Spark Dataframe中,然后传递不存在的文件吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从S3读取多个实木复合地板文件到一个数据帧中.目前,我正在使用以下方法执行此操作:

I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this:

files = ['s3a://dev/2017/01/03/data.parquet',
         's3a://dev/2017/01/02/data.parquet']
df = session.read.parquet(*files)

如果所有文件都存在于S3上,则此方法有效,但是当列表中的某些文件不存在时,我想请求将文件列表加载到数据帧中而不会中断.换句话说,我希望sparkSql可以将它找到的尽可能多的文件加载到数据帧中,并在不抱怨的情况下返回此结果.这可能吗?

This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as it finds into the dataframe, and return this result without complaining. Is this possible?

推荐答案

是的,如果您将指定输入的方法更改为hadoop glob模式,则是可能的,例如:

Yes, it's possible if you change method of specifying input to hadoop glob pattern, for example:

files = 's3a://dev/2017/01/{02,03}/data.parquet'
df = session.read.parquet(files)

您可以在但是,在我看来,这不是按时间(在您的情况下按天)划分的数据处理的优雅方式.如果您能够这样重命名目录:

But, in my opinion this isn't elegant way of working with data partitioned by time (by day in your case). If you are able to rename directories like this:

  • s3a://dev/2017/01/03/data.parquet-> s3a://dev/day=2017-01-03/data.parquet
  • s3a://dev/2017/01/02/data.parquet-> s3a://dev/day=2017-01-02/data.parquet
  • s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet
  • s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet

然后,您可以利用火花分区模式并通过以下方式读取数据:

then you can take advantage of spark partitioning schema and read data by:

session.read.parquet('s3a://dev/') \
    .where(col('day').between('2017-01-02', '2017-01-03')

这种方式也将忽略空目录/不存在的目录.另外的列day将出现在您的数据框中(在spark< 2.1.0中为字符串,在spark> = 2.1.0中为datetime),因此您将知道每个记录在哪个目录中.

This way will omit empty/non-existing directories as well. Additionall column day will appear in your dataframe (it will be string in spark <2.1.0 and datetime in spark >= 2.1.0), so you will know in which directory each record exists.

这篇关于我可以将多个文件从S3读入Spark Dataframe中,然后传递不存在的文件吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆