如何有效地将数据框对象解析为键-值对映射 [英] how to efficiently parse dataframe object into a map of key-value pairs

查看:107
本文介绍了如何有效地将数据框对象解析为键-值对映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用具有列basketIDitemID的数据框.有没有一种方法可以有效地解析数据集并生成键为basketID且值是每个购物篮中包含的所有itemID的集合的映射?

i'm working with a dataframe with the columns basketID and itemID. is there a way to efficiently parse through the dataset and generate a map where the keys are basketID and the value is a set of all the itemID contained within each basket?

我当前的实现在数据帧上使用for循环,这不是很可扩展的.有可能更有效地做到这一点吗?任何帮助,将不胜感激,谢谢!

my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks!

示例数据的屏幕截图

目标是获得basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6")).这是我使用for循环的实现

the goal is to obtain basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6")). heres the implementation I have using a for loop

// create empty container
val basket = scala.collection.mutable.Map[String, Set[String]]()
// loop over all numerical indexes for baskets (b<i>)
for (i <- 1 to 4) {
  basket("b" + i.toString) = Set();
}
// loop over every row in df and store the items to the set
df.collect().foreach(row => 
  basket(row(0).toString) += row(1).toString
)

推荐答案

您可以简单地

You can simply do aggregateByKey operation then collectItAsMap will directly give you the desired result. It is much more efficient than simple groupBy.

import scala.collection.mutable
case class Items(basketID: String,itemID: String)
 
 import spark.implicits._
 val result = output.as[Items].rdd.map(x => (x.basketID,x.itemID))
.aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())
 ((l: mutable.Buffer[String], p: String) => l += p , 
 (l1: mutable.Buffer[String], l2: mutable.Buffer[String]) => (l1 ++ l2).distinct)
.collectAsMap();

您可以通过此处. 请同时检查

you can check other aggregation api's like reduceBy and groupBy over here. please also check aggregateByKey vs groupByKey vs ReduceByKey differences.

这篇关于如何有效地将数据框对象解析为键-值对映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆