使用pyspark连接数据框的多列 [英] Concat multiple columns of a dataframe using pyspark

查看:124
本文介绍了使用pyspark连接数据框的多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个列列表,例如:

Suppose I have a list of columns, for example:

col_list = ['col1','col2']
df = spark.read.json(path_to_file)
print(df.columns)
# ['col1','col2','col3']

我需要通过串联col1col2创建一个新列.我不想在连接时对列名进行硬编码,但需要从列表中选择.

I need to create a new column by concatenating col1 and col2. I don't want to hard code the column names while concatenating but need to pick it from the list.

我该怎么做?

推荐答案

您可以使用 pyspark.sql.functions.concat() concatenate您在list中指定的列数.继续将它们作为参数传递.

You can use pyspark.sql.functions.concat() to concatenate as many columns as you specify in your list. Keep on passing them as arguments.

from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  A1|  11|  A3|  A4|
|  B1|  22|  B3|  B4|
|  C1|  33|  C3|  C4|
+----+----+----+----+

concat()函数中,传递需要连接的所有列-如concat('col1','col2').如果有列表,则可以使用* un-list对其进行列表.所以(*['col1','col2'])返回('col1','col2')

In the concat() function, you pass all the columns you need to concatenate - like concat('col1','col2'). If you have a list, you can un-list it using *. So (*['col1','col2']) returns ('col1','col2')

col_list = ['col1','col2']
df = df.withColumn('concatenated_cols',concat(*col_list))
df.show()
+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
|  A1|  11|  A3|  A4|             A111|
|  B1|  22|  B3|  B4|             B122|
|  C1|  33|  C3|  C4|             C133|
+----+----+----+----+-----------------+

这篇关于使用pyspark连接数据框的多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆