如何按Spark中的多个键分组? [英] How to group by multiple keys in spark?

查看：555 发布时间：2020/9/4 3:16:49 python apache-spark pyspark

本文介绍了如何按Spark中的多个键分组?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一堆元组，它们是复合键和值的形式.例如

I have a bunch of tuples which are in form of composite keys and values. For example,

tfile.collect() = [(('id1','pd1','t1'),5.0), 
     (('id2','pd2','t2'),6.0),
     (('id1','pd1','t2'),7.5),
     (('id1','pd1','t3'),8.1)  ]

我想对此集合执行类似sql的操作，在这里我可以基于id [1..n]或pd [1..n]汇总信息.我想使用香草pyspark API而不是使用SQLContext来实现. 在我当前的实现中，我正在读取一堆文件并合并RDD.

I want to perform sql like operations on this collection, where I can aggregate the information based on id[1..n] or pd[1..n] . I want to implement using the vanilla pyspark apis and not using SQLContext. In my current implementation I am reading from a bunch of files and merging the RDD.

def readfile():
    fr = range(6,23)
    tfile = sc.union([sc.textFile(basepath+str(f)+".txt")
                        .map(lambda view: set_feature(view,f)) 
                        .reduceByKey(lambda a, b: a+b)
                        for f in fr])
    return tfile

我打算创建一个聚合数组作为值.例如

I intend to create an aggregated array as a value. For example,

agg_tfile = [((id1,pd1),[5.0,7.5,8.1])]

其中5.0,7.5,8.1代表[t1，t2，t3].我目前正在使用字典通过香草python代码实现相同的目的.它适用于较小的数据集.但是我担心，因为这可能无法扩展到更大的数据集.有没有有效的方法来使用pyspark api实现相同的目标?

where 5.0,7.5,8.1 represent [t1,t2,t3] . I am currently, achieving the same by vanilla python code using dictionaries. It works fine for smaller data sets. But I worry as this may not scale for larger data sets. Is there an efficient way achieving the same using pyspark apis ?

推荐答案

我的猜测是您希望根据多个字段转置数据.

My guess is that you want to transpose the data according to multiple fields.

一种简单的方法是连接要分组的目标字段，并使其成为成对的RDD中的键.例如:

A simple way is to concatenate the target fields that you will group by, and make it a key in a paired RDD. For example:

lines = sc.parallelize(['id1,pd1,t1,5.0', 'id2,pd2,t2,6.0', 'id1,pd1,t2,7.5', 'id1,pd1,t3,8.1'])
rdd = lines.map(lambda x: x.split(',')).map(lambda x: (x[0] + ', ' + x[1], x[3])).reduceByKey(lambda a, b: a + ', ' + b)
print rdd.collect()

然后您将获得转置结果.

Then you will get the transposed result.

[('id1, pd1', '5.0, 7.5, 8.1'), ('id2, pd2', '6.0')]

这篇关于如何按Spark中的多个键分组?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何按Spark中的多个键分组? [英] How to group by multiple keys in spark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何按Spark中的多个键分组? [英] How to group by multiple keys in spark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭