PySpark:使用从列创建的元组添加新列 [英] PySpark: Add a new column with a tuple created from columns

查看：227 发布时间：2020/9/4 8:16:53 python apache-spark pyspark apache-spark-sql spark-dataframe

本文介绍了PySpark:使用从列创建的元组添加新列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在这里，我创建了一个如下所示的日期框架，

Here I have a dateframe created as follow,

df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')], 
                       ["Id","V1","V2","V3"])

看起来像

+---+---+---+---+
| Id| V1| V2| V3|
+---+---+---+---+
|  a|  5|  R|  X|
|  b|  7|  G|  S|
|  c|  8|  G|  S|
+---+---+---+---+

我希望添加一列，该列是由V1，V2，V3组成的元组.

I'm looking to add a column that is a tuple consisting of V1,V2,V3.

结果应类似于

+---+---+---+---+-------+
| Id| V1| V2| V3|V_tuple|
+---+---+---+---+-------+
|  a|  5|  R|  X|(5,R,X)|
|  b|  7|  G|  S|(7,G,S)|
|  c|  8|  G|  S|(8,G,S)|
+---+---+---+---+-------+

我尝试使用与Python类似的syntex，但是它不起作用:

I've tried to use similar syntex as in Python but it didn't work:

df.withColumn("V_tuple",list(zip(df.V1,df.V2,df.V3)))

TypeError:zip参数1必须支持迭代.

TypeError: zip argument #1 must support iteration.

任何帮助将不胜感激！

推荐答案

我来自scala，但我确实相信python中也有类似的方式:

I'm coming from scala but I do believe that there's a similar way in python :

使用打包方法:

如果要使用这三列获取StructType，请使用struct(cols: Column*): Column方法，如下所示:

If you want to get a StructType with this three column use the struct(cols: Column*): Column method like this :

from pyspark.sql.functions import struct
df.withColumn("V_tuple",struct(df.V1,df.V2,df.V3))

，但是如果要以字符串形式获取它，可以使用如下的concat(exprs: Column*): Column方法:

but if you want to get it as a String you can use the concat(exprs: Column*): Column method like this :

from pyspark.sql.functions import concat
df.withColumn("V_tuple",concat(df.V1,df.V2,df.V3))

使用第二种方法，您可能必须将列转换为String s

With this second method you may have to cast the columns into Strings

我不确定python的语法，如果语法错误，只需编辑答案即可.

I'm not sure about the python syntax, Just edit the answer if there's a syntax error.

希望这对您有所帮助.最好的问候

Hope this help you. Best Regards

这篇关于PySpark:使用从列创建的元组添加新列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark:使用从列创建的元组添加新列 [英] PySpark: Add a new column with a tuple created from columns

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark:使用从列创建的元组添加新列 [英] PySpark: Add a new column with a tuple created from columns

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭