PySpark中列表中元素的区别 [英] Difference of elements in list in PySpark

查看:66
本文介绍了PySpark中列表中元素的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个PySpark数据帧( df ),其中的一列包含具有两个元素的列表.列表中的两个元素不是按升序或降序排列的.

I have a PySpark dataframe (df) with a column which contains lists with two elements. The two elements in the list are not ordered by ascending or descending orders.

+--------+----------+-------+
| version| timestamp| list  |
+--------+-----+----|-------+
| v1     |2012-01-10| [5,2] |
| v1     |2012-01-11| [2,5] |
| v1     |2012-01-12| [3,2] |
| v2     |2012-01-12| [2,3] |
| v2     |2012-01-11| [1,2] |
| v2     |2012-01-13| [2,1] |
+--------+----------+-------+

我想在列表的第一和第二个元素之间加以区别,并将其作为另一列( diff ).这是我想要的输出示例.

I want to take difference betweeen the first and the second elements of the list and have that as another column (diff). Here is an example of the output that I want.

+--------+----------+-------+-------+
| version| timestamp| list  |  diff | 
+--------+-----+----|-------+-------+
| v1     |2012-01-10| [5,2] |   3   |
| v1     |2012-01-11| [2,5] |  -3   |
| v1     |2012-01-12| [3,2] |   1   |
| v2     |2012-01-12| [2,3] |  -1   |
| v2     |2012-01-11| [1,2] |  -1   |
| v2     |2012-01-13| [2,1] |   1   |
+--------+----------+-------+-------+

如何使用PySpark做到这一点?

我尝试了以下操作:

transform_expr = (
        "transform(diff, x-y ->"
        + "x as list[0], y as list[1])"
    )

df = df.withColumn("diff", F.expr(transform_expr)) 

但是,上述技术没有给我任何输出.

But, the above technique did not give me any output.

我也愿意使用UDF获得预期的输出,以防万一.

I am also open to the use of UDFs to get my intended output in case one needs that.

欢迎使用没有UDF的方法和基于UDF的方法.谢谢.

Approaches without UDF and those which are based on UDF are both welcome. Thanks.

推荐答案

有多种方法可以执行此操作,您可以使用 element_at (Spark 2.4或更高版本), transform 数组索引[0] .getItem()来获取差异.

There are multiple ways to do this, you can use any of element_at (Spark 2.4 or newer), transform, array index[0] or .getItem() to get the difference.

#sample dataframe
df=spark.createDataFrame([([5,2],),([2,5],)],["list"])

#using element_at
df.withColumn("diff",element_at(col("list"),1) - element_at(col("list"),2)).show()

#using transform 
df.withColumn("diff",concat_ws("",expr("""transform(array(list),x -> x[0] - x[1])"""))).show()

#using array index
df.withColumn("diff",col("list")[0]- col("list")[1]).show()

#using .getItem
df.withColumn("diff",col("list").getItem(0)- col("list").getItem(1)).show()

#+------+----+
#|  list|diff|
#+------+----+
#|[5, 2]|   3|
#|[2, 5]|  -3|
#+------+----+

这篇关于PySpark中列表中元素的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆