pyspark使用s3中的regex/glob选择文件子集 [英] pyspark select subset of files using regex/glob from s3

查看：93 发布时间：2020/8/23 4:23:46 regex amazon-s3 apache-spark glob pyspark

本文介绍了pyspark使用s3中的regex/glob选择文件子集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数字文件，每个文件按日期在Amazon S3上按日期(date=yyyymmdd)分隔.这些文件可以追溯到6个月，但我想将脚本限制为仅使用最近3个月的数据.我不确定我是否可以使用正则表达式执行类似sc.textFile("s3://path_to_dir/yyyy[m1,m2,m3]*")

I have a number files each segregated by date (date=yyyymmdd) on amazon s3. The files go back 6 months but I would like to restrict my script to only use the last 3 months of data. I am unsure as to whether I will be able to use regular expressions to do something like sc.textFile("s3://path_to_dir/yyyy[m1,m2,m3]*")

其中m1，m2，m3代表我要使用的当前日期起的3个月.

where m1,m2,m3 represents the 3 months from the current date that I would like to use.

一个讨论还建议使用类似sc.textFile("s3://path_to_dir/yyyym1*","s3://path_to_dir/yyyym2*","s3://path_to_dir/yyyym3*")的方法，但这似乎对我不起作用.

One discussion also suggested using something like sc.textFile("s3://path_to_dir/yyyym1*","s3://path_to_dir/yyyym2*","s3://path_to_dir/yyyym3*") but that doesn't seem to work for me.

sc.textFile( )是否使用正则表达式?我知道您可以使用全局表达式，但是不确定如何将上述情况表示为全局表达式?

Does sc.textFile( ) take regular expressions? I know you can use glob expressions but I was unsure how to represent the above case as a glob expression?

推荐答案

对于第一个选择，使用花括号:

For your first option, use curly braces:

sc.textFile("s3://path_to_dir/yyyy{m1,m2,m3}*")

第二种选择是，您可以将每个单独的glob读入RDD，然后将这些RDD合并为一个:

For your second option, you can read each single glob into an RDD and then union those RDDs into a single one:

m1 = sc.textFile("s3://path_to_dir/yyyym1*")
m2 = sc.textFile("s3://path_to_dir/yyyym2*")
m3 = sc.textFile("s3://path_to_dir/yyyym3*")
all = m1.union(m2).union(m3)

您可以将globs与sc.textFile一起使用，但不能与完整的正则表达式一起使用.

You can use globs with sc.textFile but not full regular expressions.

这篇关于pyspark使用s3中的regex/glob选择文件子集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark使用s3中的regex/glob选择文件子集 [英] pyspark select subset of files using regex/glob from s3

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark使用s3中的regex/glob选择文件子集 [英] pyspark select subset of files using regex/glob from s3

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭