Pyspark:选择部分字符串(文件路径)列值 [英] Pyspark: Select part of the string(file path) column values

查看:34
本文介绍了Pyspark:选择部分字符串(文件路径)列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Pyspark: 拆分并选择部分字符串列值

如何选择 spark DF 列中第 4 个(从左起)反斜杠之后的字符或文件路径?

How can I select the characters or file path after the 4th(from left) backslash from the column in a spark DF?

pyspark 列的示例行:

\\D\Dev\johnny\Desktop\TEST
\\D\Dev\matt\Desktop\TEST\NEW
\\D\Dev\matt\Desktop\TEST\OLD\TEST
\\E\dev\peter\Desktop\RUN\SUBFOLDER\New
\\K924\prod\ums\Desktop\RUN\SUBFOLDER\New
\\LE345\jskx\rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
\\ls53\f7sn3\vso\hsk\mwq\sdsf\kse

预期产出

johnny\Desktop\TEST
matt\Desktop\TEST\NEW
matt\Desktop\TEST\OLD\TEST
peter\Desktop\RUN\SUBFOLDER\New
ums\Desktop\RUN\SUBFOLDER\New
rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
vso\hsk\mwq\sdsf\kse

我以前的 question 导致了这个新问题.感谢您的帮助.

My previous question led to this new question. Appreciate any help.

推荐答案

您可以在 regexp_replace 中使用正则表达式,例如.

You may use a regular expression in regexp_replace eg.

from pyspark.sql import functions as F

df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\[a-zA-Z0-9]+\\\\[a-zA-Z0-9]+\\\\",""))

您也可以更灵活地使用此解决方案,例如.

you may also be more flexible with this solution eg.

from pyspark.sql import functions as F
no_of_slashes=4 # number of slashes to consider here

# we build the regular expression by repeating `"[a-zA-Z0-9]+\\\\"`
# NB. We subtract 2 since we start with the frst 2 slashes
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\"+("[a-zA-Z0-9]+\\\\"*(no_of_slashes-2)),""))

告诉我这是否适合您.

这篇关于Pyspark:选择部分字符串(文件路径)列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆