在pyspark中根据另一列的值拆分一列 [英] Split one column based the value of another column in pyspark
问题描述
我有以下数据框
+----+-------+
|item| path|
+----+-------+
| a| a/b/c|
| b| e/b/f|
| d|e/b/d/h|
| c| g/h/c|
+----+-------+
我想通过找到列'path'
中的值并提取路径的LHS来找到列 "item"
中an的相对路径,如下所示
I want to find relative path of an of the column "item"
by locating its value in column 'path'
and extracting the path's LHS as shown below
+----+-------+--------+
|item| path|rel_path|
+----+-------+--------+
| a| a/b/c| a|
| b| e/b/f| e/b|
| d|e/b/d/h| e/b/d|
| c| g/h/c| g/h/c|
+----+-------+--------+
我尝试使用函数 split((str, pattern)
或regexp_extract(str, pattern, idx)
,但不确定如何将列'item'
的值传递到其模式部分.知道不编写函数怎么办?
I tried to use functions split((str, pattern)
or regexp_extract(str, pattern, idx)
but not sure how to pass the value of column 'item'
into their pattern section . Any idea how that could be done without writing a function?
推荐答案
您可以使用 pyspark.sql.functions.expr
到将列值作为参数传递给 regexp_replace
.在这里,您需要将item
的负向后缀与.+
连接起来,以匹配之后的所有内容,并替换为空字符串.
You can use pyspark.sql.functions.expr
to pass a column value as a parameter to regexp_replace
. Here you need to concatenate the a negative lookbehind for item
with .+
to match everything after, and replace with an empty string.
from pyspark.sql.functions import expr
df.withColumn(
"rel_path",
expr("regexp_replace(path, concat('(?<=',item,').+'), '')")
).show()
#+----+-------+--------+
#|item| path|rel_path|
#+----+-------+--------+
#| a| a/b/c| a|
#| b| e/b/f| e/b|
#| d|e/b/d/h| e/b/d|
#| c| g/h/c| g/h/c|
#+----+-------+--------+
这篇关于在pyspark中根据另一列的值拆分一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!