用两列创建一个元组-PySpark [英] Create a tuple out of two columns - PySpark

查看:78
本文介绍了用两列创建一个元组-PySpark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是基于此处的类似问题

My problem is based on the similar question here PySpark: Add a new column with a tuple created from columns, with the difference that I have a list of values instead of one value per column. For example:

from pyspark.sql import Row
df = sqlContext.createDataFrame([Row(v1=[u'2.0', u'1.0', u'9.0'], v2=[u'9.0', u'7.0', u'2.0']),Row(v1=[u'4.0', u'8.0', u'9.0'], v2=[u'1.0', u'1.0', u'2.0'])])

    +---------------+---------------+
    |             v1|             v2|
    +---------------+---------------+
    |[2.0, 1.0, 9.0]|[9.0, 7.0, 2.0]|
    |[2.0, 1.0, 9.0]|[9.0, 7.0, 2.0]|
    +---------------+---------------+

我想要得到的是类似于zip element-wise的每行列表,但是我无法在pyspark 1.6中弄清楚它:

What I am trying to get is something similar like zip element-wise for the lists per row, but I cant figure it out in pyspark 1.6:

+---------------+---------------+--------------------+
|             v1|             v2|             v_tuple|
+---------------+---------------+--------------------+
|[2.0, 1.0, 9.0]|[9.0, 7.0, 2.0]|[(2.0,9.0), (1.0,...|
|[4.0, 8.0, 9.0]|[1.0, 1.0, 2.0]|[(4.0,1.0), (8.0,...|
+---------------+---------------+--------------------+

注意:数组的大小可能会在行与行之间有所不同,但对于同一行,列的大小总是相同的.

Note: The size of the arrays may vary from row to row, but it is always the same for the same row column-wise.

推荐答案

如果行和行之间的数组大小不同,则需要使用UDF:

If size of the arrays varies from row to row you'll need and UDF:

from pyspark.sql.functions import udf

@udf("array<struct<_1:double,_2:double>>")
def zip_(xs, ys):
    return list(zip(xs, ys))

df.withColumn("v_tuple", zip_("v1", "v2"))

在Spark 1.6中:

In Spark 1.6:

from pyspark.sql.types import *

zip_ = udf(
    lambda xs, ys: list(zip(xs, ys)),
    ArrayType(StructType([StructField("_1", DoubleType()), StructField("_2", DoubleType())])))

这篇关于用两列创建一个元组-PySpark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆