使用文件名创建变量-PySpark [英] Using filenames to create variable - PySpark
问题描述
我有一个文件夹,该文件夹中放置有文件(每天,每周),我需要以相同的格式在文件名中添加年和周/日,作为数据框的变量.前缀可以更改(例如, sales_report
, cash_flow
等),但最后一个字符始终为 YYYY_WW.csv
.
I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report
, cash_flow
, etc.) but the last characters are always YYYY_WW.csv
.
例如,对于每周一次的文件,我可以手动为每个文件执行以下操作:
For instance, for a weekly file I could manually do it for each file as:
from pyspark.sql.functions import lit
df = spark.read.load('my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))
我想做一个等效的事情,即使用从文件名右边开始计数的子字符串函数来解析 12
和 2019
.我是否能够解析这些变量的文件名,然后可以使用通配符(例如 df = spark.read.load('my_folder/sales_report _ *.csv',format ="csv")
,这将大大简化我的代码.
I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12
and 2019
. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df = spark.read.load('my_folder/sales_report_*.csv', format="csv")
which would greatly simplify my code.
推荐答案
您可以使用 input_file_name()
列和一些字符串函数(如 regexp_extract
和 substring_index
:
You can easily extract it from the filename using the input_file_name()
column and some string functions like regexp_extract
and substring_index
:
df = spark.read.load('my_folder/*.csv', format="csv")
df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
.withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
.withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
.drop("year_week")
这篇关于使用文件名创建变量-PySpark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!