从spark数据框列中提取值到新的派生列中 [英] Extract values from spark dataframe column into new derived column
本文介绍了从spark数据框列中提取值到新的派生列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在下面具有以下数据框架构
I have the following dataframe schema below
root
|-- SOURCE: string (nullable = true)
|-- SYSTEM_NAME: string (nullable = true)
|-- BUCKET_NAME: string (nullable = true)
|-- LOCATION: string (nullable = true)
|-- FILE_NAME: string (nullable = true)
|-- LAST_MOD_DATE: string (nullable = true)
|-- FILE_SIZE: string (nullable = true)
我想从某些列中提取数据值后派生一列.位置列中的数据如下所示:
I would like to derive a column after extracting the data values from certain columns. The data in location column looks like the following:
example 1: prod/docs/Folder1/AA160039/Folder2/XXX.pdf
example 2: prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx
问题1:我想派生一个称为"folder_num"的新列.并删除以下内容:
Question 1: I would like to derive a new column called "folder_num" and strip out the following:
1. the 2 characters followed by 6 digits between the slashes. Output is "AA160039".This expression or mask will not change. always 2 characters followed by 6 digits
2. strip digits only if they are between slashes. Output is "355" from example above. The numbers could be a single digit such as "8", double digits "55", triple "444", up to 5 digits "12345". As long as they are between slashes, they need to be extracted into new column.
如何在火花中实现这一目标?我是这项技术的新手,非常感谢您的帮助.
How can I achieve this in spark? I'm new to this technology so your help is much appreciated.
df1 = df0.withColumn("LOCATION", trim(col('LOCATION')))
if location like '%/[A-Z]{2}[0-9]{6}/%' -- extract value and add to new derived column
if location like '%/[0-9]{1 or 2 or 3 or 4 or 5}/%' -- extract value and add to new derived column
谢谢您的帮助.
添加的代码:
df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
.withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
.withColumn("LOCATION", trim(col('LOCATION')))\
.withColumn("FOLDER_NUM", when(regexp_extract(col("FILE_NAME"), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""),
regexp_extract(col("LOCATION"), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
.otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1)))
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|SOURCE|SYSTEM_NAME| BUCKET_NAME| LOCATION| FILE_NAME| LAST_MOD_DATE|FILE_SIZE|FOLDER_NUM|
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
| s3| xxx|bucket1|production/Notifi...|AA120068_Letter.pdf|2020-07-20 15:51:21| 13124| |
| s3| xxx|bucket1|production/Notifi...|ZZ120093_Letter.pdf|2020-07-20 15:51:21| 61290| |
| s3| xxx|bucket1|production/Notifi...|XC120101_Letter.pdf|2020-07-20 15:51:21| 61700| |
推荐答案
所提供的信息确实很有帮助.我感谢每个人都使我走上正确的道路.最终的代码版本如下.
The info was really helpful provided. I appreciate everyone for putting me on the right track. The final code version is below.
df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
.withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
.withColumn("LOCATION", trim(col('LOCATION')))\
.withColumn("FOLDER_NUM", when(regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), regexp_extract(trim(col("FILE_NAME")), "([A-Z]{2}[0-9]{6}).*", 1))
.when(regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1) != lit(""), regexp_extract(trim(col("LOCATION")), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
.when(regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1) != lit(""), regexp_extract(trim(col("LOCATION")),".*/([0-9]{1,5})/.*" , 1))
.otherwise("Unknown"))
谢谢.
这篇关于从spark数据框列中提取值到新的派生列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文