PySpark - 将单个整数列表与列表列进行比较 [英] PySpark - compare single list of integers to column of lists

查看：81 发布时间：2021/6/24 20:39:16 python apache-spark pyspark

本文介绍了PySpark - 将单个整数列表与列表列进行比较的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试检查 spark 数据框中的哪些条目(带列表的列)包含给定列表中的最大数量的值.

I'm trying to check which entries in a spark dataframe (column with lists) contain the largest quantity of values from a given list.

我想出的最佳方法是使用 rdd.foreach() 遍历数据帧，并使用 python 的 set1.intersection(set2)<将给定列表与每个条目进行比较/代码>.


The best approach I've came up with is iterating over a dataframe with rdd.foreach() and comparing a given list to every entry using python's set1.intersection(set2).
我的问题是 spark 是否有任何内置功能，因此可以避免使用 .foreach 进行迭代?
My question is does spark have any built-in functionality for this so iterating with .foreach could be avoided?
感谢您的帮助！
附言我的数据框如下所示:
P.S. my dataframe looks like this:
+-------------+---------------------+                                           
|   cardnumber|collect_list(article)|
+-------------+---------------------+
|2310000000855| [12480, 49627, 80...|
|2310000008455| [35531, 22564, 15...|
|2310000011462| [117112, 156087, ...|
+-------------+---------------------+

我正在尝试使用给定的文章列表在第二列中找到交叉点最多的条目，例如 [151574, 87239, 117908, 162475, 48599]
And I'm trying to find entries with the most intersections in the second column with a given list of articles, e.g [151574, 87239, 117908, 162475, 48599]
推荐答案
这里唯一的选择是 udf，但不会有太大区别.
The only alternative here is udf, but it won't be much of a difference.
from pyspark.sql.functions import udf, li, col

def intersect(xs):
    xs = set(xs)
    @udf("array<long>")
    def _(ys):
        return list(xs.intersection(ys))
    return _

它可以应用为:
a_list = [1, 4, 6]

df = spark.createDataFrame([
    (1, [3, 4, 8]), (2, [7, 2, 6])
], ("id", "articles"))

df.withColumn("intersect", intersect(a_list)("articles")).show()

# +---+---------+---------+
# | id| articles|intersect|
# +---+---------+---------+
# |  1|[3, 4, 8]|      [4]|
# |  2|[7, 2, 6]|      [6]|
# +---+---------+---------+

根据名称，您似乎使用了 collect_list，因此您的数据可能如下所示:
Based on the names, it looks like you use collect_list so your data looks probably like this:
df_long = spark.createDataFrame([
    (1, 3),(1, 4), (1, 8), (2, 7), (2, 7), (2, 6)
], ("id", "articles"))

那样的话问题就简单了.加入
In that case problem is simpler. Join
lookup = spark.createDataFrame(a_list, "long").toDF("articles")

joined = lookup.join(df_long, ["articles"])

并汇总结果:
joined.groupBy("id").count().show()
# +---+-----+                                                                     
# | id|count|
# +---+-----+
# |  1|    1|
# |  2|    1|
# +---+-----+


joined.groupBy("id").agg(collect_list("articles")).show()
# +---+----------------------+                                                    
# | id|collect_list(articles)|
# +---+----------------------+
# |  1|                   [4]|
# |  2|                   [6]|
# +---+----------------------+


                        这篇关于PySpark - 将单个整数列表与列表列进行比较的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

PySpark - 将单个整数列表与列表列进行比较 [英] PySpark - compare single list of integers to column of lists

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark - 将单个整数列表与列表列进行比较 [英] PySpark - compare single list of integers to column of lists

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭