python spark替代方法可针对非常大的数据爆炸 [英] python spark alternative to explode for very large data

查看：72 发布时间：2020/9/4 8:33:47 python arrays apache-spark count

本文介绍了python spark替代方法可针对非常大的数据爆炸的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个像这样的数据框:

I have a dataframe like this:

df = spark.createDataFrame([(0, ["B","C","D","E"]),(1,["E","A","C"]),(2, ["F","A","E","B"]),(3,["E","G","A"]),(4,["A","C","E","B","D"])], ["id","items"])

会创建一个数据帧df，如下所示:

which creates a data frame df like this:

+---+-----------------+
|  0|     [B, C, D, E]|
|  1|        [E, A, C]|
|  2|     [F, A, E, B]|
|  3|        [E, G, A]|
|  4|  [A, C, E, B, D]|
+---+-----------------+

我想得到这样的结果:

+---+-----+
|all|count|
+---+-----+
|  F|    1|
|  E|    5|
|  B|    3|
|  D|    2|
|  C|    3|
|  A|    4|
|  G|    1|
+---+-----+

从本质上讲，它只是找到df["items"]中所有不同的元素，并对它们的出现频率进行计数.如果我的数据的大小更易于管理，那么我就可以这样做:

Which essentially just finds all distinct elements in df["items"] and counts their frequency. If my data was of a more manageable size, I would just do this:

all_items = df.select(explode("items").alias("all")) 
result = all_items.groupby(all_items.all).count().distinct() 
result.show()

但是因为我的数据在每个列表中都有数百万行和数千个元素，所以这不是一种选择.我当时想逐行执行此操作，因此我一次只能处理2个列表.因为大多数元素经常在许多行中重复(但是每行中的列表是一个集合)，所以这种方法应该可以解决我的问题.但是问题是，我真的不知道如何在Spark中执行此操作，因为我才刚刚开始学习它.有人可以帮忙吗?

But because my data has millions of rows and thousands of elements in each list, this is not an option. I was thinking of doing this row by row, so that I only work with 2 lists at a time. Because most elements are frequently repeated in many rows (but the list in each row is a set), this approach should solve my problem. But the problem is, I don't really know how to do this in Spark, as I've only just started learning it. Could anyone help, please?

python spark替代方法可针对非常大的数据爆炸 [英] python spark alternative to explode for very large data

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

python spark替代方法可针对非常大的数据爆炸 [英] python spark alternative to explode for very large data

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭