如何使用分隔符连接 PySpark 中的多列? [英] How to concatenate multiple columns in PySpark with a separator?
问题描述
我有一个 pyspark Dataframe
,我想加入 3 列.
I have a pyspark Dataframe
, I would like to join 3 columns.
id | column_1 | column_2 | column_3
--------------------------------------------
1 | 12 | 34 | 67
--------------------------------------------
2 | 45 | 78 | 90
--------------------------------------------
3 | 23 | 93 | 56
--------------------------------------------
我想加入 3 列:column_1, column_2, column_3
仅在其中添加一个值 "-"
I want to join the 3 columns : column_1, column_2, column_3
in only one adding between there value "-"
预期结果:
id | column_1 | column_2 | column_3 | column_join
-------------------------------------------------------------
1 | 12 | 34 | 67 | 12-34-67
-------------------------------------------------------------
2 | 45 | 78 | 90 | 45-78-90
-------------------------------------------------------------
3 | 23 | 93 | 56 | 23-93-56
-------------------------------------------------------------
我怎样才能在 pyspark 中做到这一点?谢谢
How can I do it in pyspark ? Thank you
推荐答案
很简单:
from pyspark.sql.functions import col, concat, lit
df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3")))
使用 concat
连接所有带有 -
分隔符的列,为此您需要使用 lit
.
Use concat
to concatenate all the columns with the -
separator, for which you will need to use lit
.
如果直接不行,可以使用cast
将列类型改为字符串,col("column_1").cast("string")
If it doesn't directly work, you can use cast
to change the column types to string, col("column_1").cast("string")
更新:
或者您可以使用内置函数 concat_ws
Or you can use a more dynamic approach using a built-in function concat_ws
pyspark.sql.functions.concat_ws(sep, *cols)
pyspark.sql.functions.concat_ws(sep, *cols)
Concatenates multiple input string columns together into a single string column, using the given separator.
>>> df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
>>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect()
[Row(s=u'abcd-123')]
代码:
from pyspark.sql.functions import col, concat_ws
concat_columns = ["column_1", "column_2", "column_3"]
df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns]))
这篇关于如何使用分隔符连接 PySpark 中的多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!