使用pyspark连接数据框的多列 [英] Concat multiple columns of a dataframe using pyspark
问题描述
假设我有一个列列表,例如:
Suppose I have a list of columns, for example:
col_list = ['col1','col2']
df = spark.read.json(path_to_file)
print(df.columns)
# ['col1','col2','col3']
我需要通过串联col1
和col2
创建一个新列.我不想在连接时对列名进行硬编码,但需要从列表中选择.
I need to create a new column by concatenating col1
and col2
. I don't want to hard code the column names while concatenating but need to pick it from the list.
我该怎么做?
推荐答案
您可以使用 pyspark.sql.functions.concat()
到concatenate
您在list
中指定的列数.继续将它们作为参数传递.
You can use pyspark.sql.functions.concat()
to concatenate
as many columns as you specify in your list
. Keep on passing them as arguments.
from pyspark.sql.functions import concat
# Creating an example DataFrame
values = [('A1',11,'A3','A4'),('B1',22,'B3','B4'),('C1',33,'C3','C4')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A1| 11| A3| A4|
| B1| 22| B3| B4|
| C1| 33| C3| C4|
+----+----+----+----+
在concat()
函数中,传递需要连接的所有列-如concat('col1','col2')
.如果有列表,则可以使用*
un-list
对其进行列表.所以(*['col1','col2'])
返回('col1','col2')
In the concat()
function, you pass all the columns you need to concatenate - like concat('col1','col2')
. If you have a list, you can un-list
it using *
. So (*['col1','col2'])
returns ('col1','col2')
col_list = ['col1','col2']
df = df.withColumn('concatenated_cols',concat(*col_list))
df.show()
+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
| A1| 11| A3| A4| A111|
| B1| 22| B3| B4| B122|
| C1| 33| C3| C4| C133|
+----+----+----+----+-----------------+
这篇关于使用pyspark连接数据框的多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!