PySpark:添加一个带有从列创建的元组的新列 [英] PySpark: Add a new column with a tuple created from columns

查看：31 发布时间：2021/11/14 22:31:59 python apache-spark pyspark apache-spark-sql spark-dataframe

本文介绍了PySpark:添加一个带有从列创建的元组的新列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这里我创建了一个日期框架，如下所示，

Here I have a dateframe created as follow,

df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')], 
                       ["Id","V1","V2","V3"])

看起来像

+---+---+---+---+
| Id| V1| V2| V3|
+---+---+---+---+
|  a|  5|  R|  X|
|  b|  7|  G|  S|
|  c|  8|  G|  S|
+---+---+---+---+

我希望添加一个列，该列是由 V1、V2、V3 组成的元组.

I'm looking to add a column that is a tuple consisting of V1,V2,V3.

结果应该是这样的

+---+---+---+---+-------+
| Id| V1| V2| V3|V_tuple|
+---+---+---+---+-------+
|  a|  5|  R|  X|(5,R,X)|
|  b|  7|  G|  S|(7,G,S)|
|  c|  8|  G|  S|(8,G,S)|
+---+---+---+---+-------+

我尝试使用与 Python 中类似的语法，但没有奏效:

I've tried to use similar syntex as in Python but it didn't work:

df.withColumn("V_tuple",list(zip(df.V1,df.V2,df.V3)))

TypeError: zip 参数 #1 必须支持迭代.

TypeError: zip argument #1 must support iteration.

任何帮助将不胜感激！

推荐答案

我来自 scala，但我相信 python 中也有类似的方法:

I'm coming from scala but I do believe that there's a similar way in python :

使用 sql.functions 包方法:

Using sql.functions package mehtod :

如果您想获得具有这三列的 StructType，请使用 struct(cols: Column*): Column 方法，如下所示:

If you want to get a StructType with this three column use the struct(cols: Column*): Column method like this :

from pyspark.sql.functions import struct
df.withColumn("V_tuple",struct(df.V1,df.V2,df.V3))

但是如果你想把它作为一个字符串，你可以使用 concat(exprs: Column*): Column 方法，如下所示:

but if you want to get it as a String you can use the concat(exprs: Column*): Column method like this :

from pyspark.sql.functions import concat
df.withColumn("V_tuple",concat(df.V1,df.V2,df.V3))

使用第二种方法，您可能需要将列转换为 Strings

With this second method you may have to cast the columns into Strings

我不确定 python 语法，如果有语法错误，请编辑答案.

I'm not sure about the python syntax, Just edit the answer if there's a syntax error.

希望对你有帮助.最好的问候

Hope this help you. Best Regards

这篇关于PySpark:添加一个带有从列创建的元组的新列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark:添加一个带有从列创建的元组的新列 [英] PySpark: Add a new column with a tuple created from columns

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark:添加一个带有从列创建的元组的新列 [英] PySpark: Add a new column with a tuple created from columns

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭