我可以将多个文件从S3读入Spark Dataframe中，然后传递不存在的文件吗? [英] Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

查看：89 发布时间：2020/9/4 1:07:27 python apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了我可以将多个文件从S3读入Spark Dataframe中，然后传递不存在的文件吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从S3读取多个实木复合地板文件到一个数据帧中.目前，我正在使用以下方法执行此操作:

I would like to read multiple parquet files into a dataframe from S3. Currently, I'm using the following method to do this:

files = ['s3a://dev/2017/01/03/data.parquet',
         's3a://dev/2017/01/02/data.parquet']
df = session.read.parquet(*files)

如果所有文件都存在于S3上，则此方法有效，但是当列表中的某些文件不存在时，我想请求将文件列表加载到数据帧中而不会中断.换句话说，我希望sparkSql可以将它找到的尽可能多的文件加载到数据帧中，并在不抱怨的情况下返回此结果.这可能吗?

This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist. In other words, I would like for sparkSql to load as many of the files as it finds into the dataframe, and return this result without complaining. Is this possible?

推荐答案

是的，如果您将指定输入的方法更改为hadoop glob模式，则是可能的，例如:

Yes, it's possible if you change method of specifying input to hadoop glob pattern, for example:

files = 's3a://dev/2017/01/{02,03}/data.parquet'
df = session.read.parquet(files)

您可以在但是，在我看来，这不是按时间(在您的情况下按天)划分的数据处理的优雅方式.如果您能够这样重命名目录:

But, in my opinion this isn't elegant way of working with data partitioned by time (by day in your case). If you are able to rename directories like this:

s3a://dev/2017/01/03/data.parquet-> s3a://dev/day=2017-01-03/data.parquet
s3a://dev/2017/01/02/data.parquet-> s3a://dev/day=2017-01-02/data.parquet

s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet
s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet

然后，您可以利用火花分区模式并通过以下方式读取数据:

then you can take advantage of spark partitioning schema and read data by:

session.read.parquet('s3a://dev/') \
    .where(col('day').between('2017-01-02', '2017-01-03')

这种方式也将忽略空目录/不存在的目录.另外的列day将出现在您的数据框中(在spark< 2.1.0中为字符串，在spark> = 2.1.0中为datetime)，因此您将知道每个记录在哪个目录中.

This way will omit empty/non-existing directories as well. Additionall column day will appear in your dataframe (it will be string in spark <2.1.0 and datetime in spark >= 2.1.0), so you will know in which directory each record exists.

这篇关于我可以将多个文件从S3读入Spark Dataframe中，然后传递不存在的文件吗?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我可以将多个文件从S3读入Spark Dataframe中，然后传递不存在的文件吗? [英] Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

我可以将多个文件从S3读入Spark Dataframe中，然后传递不存在的文件吗? [英] Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭