如何有效地将数据框对象解析为键-值对映射 [英] how to efficiently parse dataframe object into a map of key-value pairs
问题描述
我正在使用具有列basketID
和itemID
的数据框.有没有一种方法可以有效地解析数据集并生成键为basketID
且值是每个购物篮中包含的所有itemID
的集合的映射?
i'm working with a dataframe with the columns basketID
and itemID
. is there a way to efficiently parse through the dataset and generate a map where the keys are basketID
and the value is a set of all the itemID
contained within each basket?
我当前的实现在数据帧上使用for循环,这不是很可扩展的.有可能更有效地做到这一点吗?任何帮助,将不胜感激,谢谢!
my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks!
目标是获得basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6"))
.这是我使用for循环的实现
the goal is to obtain basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6"))
. heres the implementation I have using a for loop
// create empty container
val basket = scala.collection.mutable.Map[String, Set[String]]()
// loop over all numerical indexes for baskets (b<i>)
for (i <- 1 to 4) {
basket("b" + i.toString) = Set();
}
// loop over every row in df and store the items to the set
df.collect().foreach(row =>
basket(row(0).toString) += row(1).toString
)
推荐答案
You can simply do aggregateByKey operation then collectItAsMap will directly give you the desired result. It is much more efficient than simple groupBy.
import scala.collection.mutable
case class Items(basketID: String,itemID: String)
import spark.implicits._
val result = output.as[Items].rdd.map(x => (x.basketID,x.itemID))
.aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())
((l: mutable.Buffer[String], p: String) => l += p ,
(l1: mutable.Buffer[String], l2: mutable.Buffer[String]) => (l1 ++ l2).distinct)
.collectAsMap();
您可以通过此处. 请同时检查
you can check other aggregation api's like reduceBy and groupBy over here. please also check aggregateByKey vs groupByKey vs ReduceByKey differences.
这篇关于如何有效地将数据框对象解析为键-值对映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!