使用文件名创建变量-PySpark [英] Using filenames to create variable - PySpark

查看：73 发布时间：2021/4/8 20:15:59 apache-spark pyspark apache-spark-sql pyspark-dataframes

本文介绍了使用文件名创建变量-PySpark的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文件夹，该文件夹中放置有文件(每天，每周)，我需要以相同的格式在文件名中添加年和周/日，作为数据框的变量.前缀可以更改(例如， sales_report ， cash_flow 等)，但最后一个字符始终为 YYYY_WW.csv .

I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report, cash_flow, etc.) but the last characters are always YYYY_WW.csv.

例如，对于每周一次的文件，我可以手动为每个文件执行以下操作:

For instance, for a weekly file I could manually do it for each file as:

from pyspark.sql.functions import lit

df = spark.read.load('my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))

我想做一个等效的事情，即使用从文件名右边开始计数的子字符串函数来解析 12 和 2019 .我是否能够解析这些变量的文件名，然后可以使用通配符(例如 df = spark.read.load('my_folder/sales_report _ *.csv'，format ="csv")，这将大大简化我的代码.

I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12 and 2019. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df = spark.read.load('my_folder/sales_report_*.csv', format="csv") which would greatly simplify my code.

推荐答案

您可以使用 input_file_name()列和一些字符串函数(如 regexp_extract 和 substring_index :


You can easily extract it from the filename using the input_file_name() column and some string functions like regexp_extract and substring_index:
df = spark.read.load('my_folder/*.csv', format="csv")

df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
       .withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
       .withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
       .drop("year_week")


                        这篇关于使用文件名创建变量-PySpark的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用文件名创建变量-PySpark [英] Using filenames to create variable - PySpark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用文件名创建变量-PySpark [英] Using filenames to create variable - PySpark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭