从spark数据框列中提取值到新的派生列中 [英] Extract values from spark dataframe column into new derived column

查看:83
本文介绍了从spark数据框列中提取值到新的派生列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在下面具有以下数据框架构

I have the following dataframe schema below

        root
         |-- SOURCE: string (nullable = true)
         |-- SYSTEM_NAME: string (nullable = true)
         |-- BUCKET_NAME: string (nullable = true)
         |-- LOCATION: string (nullable = true)
         |-- FILE_NAME: string (nullable = true)
         |-- LAST_MOD_DATE: string (nullable = true)
         |-- FILE_SIZE: string (nullable = true)

我想从某些列中提取数据值后派生一列.位置列中的数据如下所示:

I would like to derive a column after extracting the data values from certain columns. The data in location column looks like the following:

example 1: prod/docs/Folder1/AA160039/Folder2/XXX.pdf
example 2: prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx

问题1:我想派生一个称为"folder_num"的新列.并删除以下内容:

Question 1: I would like to derive a new column called "folder_num" and strip out the following:

1. the 2 characters followed by 6 digits between the slashes. Output is "AA160039".This expression or mask will not change. always 2 characters followed by 6 digits
2. strip digits only if they are between slashes. Output is "355" from example above. The numbers could be a single digit such as "8", double digits "55", triple "444", up to 5 digits "12345". As long as they are between slashes, they need to be extracted into new column. 

如何在火花中实现这一目标?我是这项技术的新手,非常感谢您的帮助.

How can I achieve this in spark? I'm new to this technology so your help is much appreciated.

df1 = df0.withColumn("LOCATION", trim(col('LOCATION')))
if location like '%/[A-Z]{2}[0-9]{6}/%' -- extract value and add to new derived column
if location like '%/[0-9]{1 or 2 or 3 or 4 or 5}/%' -- extract value and add to new derived column

谢谢您的帮助.

添加的代码:

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(col("FILE_NAME"), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), 
                                                     regexp_extract(col("LOCATION"), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
                                                .otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1)))



+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|SOURCE|SYSTEM_NAME| BUCKET_NAME|            LOCATION|          FILE_NAME|      LAST_MOD_DATE|FILE_SIZE|FOLDER_NUM|
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|    s3|       xxx|bucket1|production/Notifi...|AA120068_Letter.pdf|2020-07-20 15:51:21|    13124|       |
|    s3|       xxx|bucket1|production/Notifi...|ZZ120093_Letter.pdf|2020-07-20 15:51:21|    61290|       |
|    s3|       xxx|bucket1|production/Notifi...|XC120101_Letter.pdf|2020-07-20 15:51:21|    61700|       |

推荐答案

所提供的信息确实很有帮助.我感谢每个人都使我走上正确的道路.最终的代码版本如下.

The info was really helpful provided. I appreciate everyone for putting me on the right track. The final code version is below.

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1))
                                                .when(regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1) != lit(""), regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
                                                .when(regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1) != lit(""), regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1))
                                                .otherwise("Unknown"))

谢谢.

这篇关于从spark数据框列中提取值到新的派生列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆