使用文件名创建变量-PySpark [英] Using filenames to create variable - PySpark

查看:73
本文介绍了使用文件名创建变量-PySpark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件夹,该文件夹中放置有文件(每天,每周),我需要以相同的格式在文件名中添加年和周/日,作为数据框的变量.前缀可以更改(例如, sales_report cash_flow 等),但最后一个字符始终为 YYYY_WW.csv .

I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report, cash_flow, etc.) but the last characters are always YYYY_WW.csv.

例如,对于每周一次的文件,我可以手动为每个文件执行以下操作:

For instance, for a weekly file I could manually do it for each file as:

from pyspark.sql.functions import lit

df = spark.read.load('my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))

我想做一个等效的事情,即使用从文件名右边开始计数的子字符串函数来解析 12 2019 .我是否能够解析这些变量的文件名,然后可以使用通配符(例如 df = spark.read.load('my_folder/sales_report _ *.csv',format ="csv"),这将大大简化我的代码.

I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12 and 2019. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df = spark.read.load('my_folder/sales_report_*.csv', format="csv") which would greatly simplify my code.

推荐答案

您可以使用 input_file_name()列和一些字符串函数(如 regexp_extract 和 substring_index :

You can easily extract it from the filename using the input_file_name() column and some string functions like regexp_extract and substring_index:

df = spark.read.load('my_folder/*.csv', format="csv")

df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
       .withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
       .withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
       .drop("year_week")

这篇关于使用文件名创建变量-PySpark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆