将多个S3文件夹/路径读入PySpark [英] Reading Multiple S3 Folders / Paths Into PySpark

查看:147
本文介绍了将多个S3文件夹/路径读入PySpark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PySpark进行大数据分析.我可以使用以下命令导入存储在特定存储桶的特定文件夹中的所有CSV文件:

I am conducting a big data analysis using PySpark. I am able to import all CSV files, stored in a particular folder of a particular bucket, using the following command:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file:///home/path/datafolder/data2014/*.csv')

(其中*用作通配符)

我遇到的问题如下:

  1. 如果我要对2014年和2015年数据进行分析怎么办,即文件1为.load('file:///home/path/SFweather/data2014/*.csv'),文件2为.load('file:///home/path/SFweather/data2015/*.csv'),文件3为.load('file:///home/path/NYCweather/data2014/*.csv'),文件4为.load('file:///home/path/NYCweather/data2015/*.csv').如何同时导入多个路径以获取一个数据帧?我是否需要将它们全部单独存储为数据帧,然后在PySpark中将它们结合在一起? (您可以假设它们所有的CSV都具有相同的架构)
  2. 假设现在是2014年11月.如果我想再次运行分析,但是在最新数据"上运行,该怎么办? dec14是2014年12月?例如,我想在12月14日加载文件2:.load('file:///home/path/datafolder/data2014/dec14/*.csv'),并使用此文件:.load('file:///home/path/datafolder/data2014/nov14/*.csv')进行原始分析.有没有办法安排Jupyter笔记本电脑(或类似产品)以更新加载路径并导入最新运行(在这种情况下,"nov14"将由"dec14"代替,然后由"jan15"等取代).
  1. What if I want to do my analysis on 2014 and 2015 data i.e. file 1 is .load('file:///home/path/SFweather/data2014/*.csv'), file 2 is .load('file:///home/path/SFweather/data2015/*.csv') and file 3 is .load('file:///home/path/NYCweather/data2014/*.csv') and file 4 is .load('file:///home/path/NYCweather/data2015/*.csv'). How do I import multiple paths at the same time to get one dataframe? Do I need to store them all individually as dataframes and then join them together within PySpark? (You may assume they all CSVs have the same schema)
  2. Suppose it is November 2014 now. What if I want to run the analysis again, but on the "most recent data" run e.g. dec14 when it is December 2014? For example, I want to load in file 2: .load('file:///home/path/datafolder/data2014/dec14/*.csv') in December 14 and use this file: .load('file:///home/path/datafolder/data2014/nov14/*.csv') for the original analysis. Is there a way to schedule the Jupyter notebook (or similar) to update the load path and import the latest run (in this case 'nov14' would be replaced by 'dec14' and then 'jan15' etc).

我浏览了之前的问题,但由于这是AWS/PySpark集成的特定问题,因此无法找到答案.

I had a look through the previous questions but was unable to find an answer given this is AWS / PySpark integration specific.

预先感谢您的帮助!

[背景:我可以从包含各种大数据集的各个团队访问许多S3存储桶.将其复制到我的S3存储桶中,然后构建Jupyter笔记本似乎比直接从其存储桶中提取数据并在其顶部构建模型/表格/等等并将处理后的输出保存到数据库中要多得多.因此,我在上面发布问题.如果我的想法完全错误,请阻止我! :)]

推荐答案

只要文件的格式都相同,就可以使用通配符在多个路径中读取.

You can read in multiple paths with wildcards as long as the files are all in the same format.

在您的示例中:

.load('file:///home/path/SFweather/data2014/*.csv')
.load('file:///home/path/SFweather/data2015/*.csv')
.load('file:///home/path/NYCweather/data2014/*.csv')
.load('file:///home/path/NYCweather/data2015/*.csv')

您可以使用以下路径替换上面的4条load语句,以一次将所有csv读入一个数据帧:

You could replace the 4 load statements above with the following path to read all csv's in at once to one dataframe:

.load('file:///home/path/*/*/*.csv')

如果您想更加具体以免读取某些文件/文件夹,则可以执行以下操作:

If you want to be more specific in order to avoid reading in certain files/folders, you can do the following:

.load('file:///home/path/[SF|NYC]weather/data201[4|5]/*.csv')

这篇关于将多个S3文件夹/路径读入PySpark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆