Pyspark:选择部分字符串(文件路径)列值 [英] Pyspark: Select part of the string(file path) column values
本文介绍了Pyspark:选择部分字符串(文件路径)列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何选择 spark DF 列中第 4 个(从左起)反斜杠之后的字符或文件路径?
How can I select the characters or file path after the 4th(from left) backslash from the column in a spark DF?
pyspark 列的示例行:
\\D\Dev\johnny\Desktop\TEST
\\D\Dev\matt\Desktop\TEST\NEW
\\D\Dev\matt\Desktop\TEST\OLD\TEST
\\E\dev\peter\Desktop\RUN\SUBFOLDER\New
\\K924\prod\ums\Desktop\RUN\SUBFOLDER\New
\\LE345\jskx\rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
\\ls53\f7sn3\vso\hsk\mwq\sdsf\kse
预期产出
johnny\Desktop\TEST
matt\Desktop\TEST\NEW
matt\Desktop\TEST\OLD\TEST
peter\Desktop\RUN\SUBFOLDER\New
ums\Desktop\RUN\SUBFOLDER\New
rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
vso\hsk\mwq\sdsf\kse
我以前的 question 导致了这个新问题.感谢您的帮助.
My previous question led to this new question. Appreciate any help.
推荐答案
您可以在 regexp_replace
中使用正则表达式,例如.
You may use a regular expression in regexp_replace
eg.
from pyspark.sql import functions as F
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\[a-zA-Z0-9]+\\\\[a-zA-Z0-9]+\\\\",""))
您也可以更灵活地使用此解决方案,例如.
you may also be more flexible with this solution eg.
from pyspark.sql import functions as F
no_of_slashes=4 # number of slashes to consider here
# we build the regular expression by repeating `"[a-zA-Z0-9]+\\\\"`
# NB. We subtract 2 since we start with the frst 2 slashes
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\"+("[a-zA-Z0-9]+\\\\"*(no_of_slashes-2)),""))
告诉我这是否适合您.
这篇关于Pyspark:选择部分字符串(文件路径)列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文