如何使用分隔符连接 PySpark 中的多列? [英] How to concatenate multiple columns in PySpark with a separator?

查看：32 发布时间：2021/11/14 22:59:36 apache-spark pyspark apache-spark-sql

本文介绍了如何使用分隔符连接 PySpark 中的多列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 pyspark Dataframe，我想加入 3 列.

I have a pyspark Dataframe, I would like to join 3 columns.

id |  column_1   | column_2    | column_3
--------------------------------------------
1  |     12      |   34        |    67
--------------------------------------------
2  |     45      |   78        |    90
--------------------------------------------
3  |     23      |   93        |    56
--------------------------------------------

我想加入 3 列:column_1, column_2, column_3 仅在其中添加一个值 "-"

I want to join the 3 columns : column_1, column_2, column_3 in only one adding between there value "-"

预期结果:

id |  column_1   | column_2    | column_3    |   column_join
-------------------------------------------------------------
1  |     12      |     34      |     67      |   12-34-67
-------------------------------------------------------------
2  |     45      |     78      |     90      |   45-78-90
-------------------------------------------------------------
3  |     23      |     93      |     56      |   23-93-56
-------------------------------------------------------------

我怎样才能在 pyspark 中做到这一点?谢谢

How can I do it in pyspark ? Thank you

推荐答案

很简单:

from pyspark.sql.functions import col, concat, lit

df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3")))

使用 concat 连接所有带有 - 分隔符的列，为此您需要使用 lit.

Use concat to concatenate all the columns with the - separator, for which you will need to use lit.

如果直接不行，可以使用cast将列类型改为字符串，col("column_1").cast("string")

If it doesn't directly work, you can use cast to change the column types to string, col("column_1").cast("string")

更新:

或者您可以使用内置函数 concat_ws

Or you can use a more dynamic approach using a built-in function concat_ws

pyspark.sql.functions.concat_ws(sep, *cols)

pyspark.sql.functions.concat_ws(sep, *cols)

Concatenates multiple input string columns together into a single string column, using the given separator.

>>> df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
>>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect()
[Row(s=u'abcd-123')]

代码:

from pyspark.sql.functions import col, concat_ws

concat_columns = ["column_1", "column_2", "column_3"]
df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns]))

这篇关于如何使用分隔符连接 PySpark 中的多列?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用分隔符连接 PySpark 中的多列? [英] How to concatenate multiple columns in PySpark with a separator?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用分隔符连接 PySpark 中的多列? [英] How to concatenate multiple columns in PySpark with a separator?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭