Pyspark:选择部分字符串(文件路径)列值 [英] Pyspark: Select part of the string(file path) column values

查看：34 发布时间：2021/11/14 23:22:02 python dataframe pyspark apache-spark-sql

本文介绍了Pyspark:选择部分字符串(文件路径)列值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何选择 spark DF 列中第 4 个(从左起)反斜杠之后的字符或文件路径?

How can I select the characters or file path after the 4th(from left) backslash from the column in a spark DF?

pyspark 列的示例行:

\\D\Dev\johnny\Desktop\TEST
\\D\Dev\matt\Desktop\TEST\NEW
\\D\Dev\matt\Desktop\TEST\OLD\TEST
\\E\dev\peter\Desktop\RUN\SUBFOLDER\New
\\K924\prod\ums\Desktop\RUN\SUBFOLDER\New
\\LE345\jskx\rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
\\ls53\f7sn3\vso\hsk\mwq\sdsf\kse

预期产出

johnny\Desktop\TEST
matt\Desktop\TEST\NEW
matt\Desktop\TEST\OLD\TEST
peter\Desktop\RUN\SUBFOLDER\New
ums\Desktop\RUN\SUBFOLDER\New
rfk\Desktop\RUN\SUBFOLDER\New
.
.
.
vso\hsk\mwq\sdsf\kse

我以前的 question 导致了这个新问题.感谢您的帮助.

My previous question led to this new question. Appreciate any help.

推荐答案

您可以在 regexp_replace 中使用正则表达式，例如.

You may use a regular expression in regexp_replace eg.

from pyspark.sql import functions as F

df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\[a-zA-Z0-9]+\\\\[a-zA-Z0-9]+\\\\",""))

您也可以更灵活地使用此解决方案，例如.

you may also be more flexible with this solution eg.

from pyspark.sql import functions as F
no_of_slashes=4 # number of slashes to consider here

# we build the regular expression by repeating `"[a-zA-Z0-9]+\\\\"`
# NB. We subtract 2 since we start with the frst 2 slashes
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\"+("[a-zA-Z0-9]+\\\\"*(no_of_slashes-2)),""))

告诉我这是否适合您.

这篇关于Pyspark:选择部分字符串(文件路径)列值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark:选择部分字符串(文件路径)列值 [英] Pyspark: Select part of the string(file path) column values

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pyspark:选择部分字符串(文件路径)列值 [英] Pyspark: Select part of the string(file path) column values

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭