如何有效地将数据帧对象解析为键值对映射 [英] how to efficiently parse dataframe object into a map of key-value pairs
问题描述
我正在处理一个包含 basketID
和 itemID
列的数据框.有没有办法有效地解析数据集并生成一个映射,其中键是 basketID
,值是每个篮子中包含的所有 itemID
的集合?>
我当前的实现在数据框上使用 for 循环,这不是很可扩展.有没有可能更有效地做到这一点?任何帮助将不胜感激谢谢!
目标是获得 basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set(";i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6"))
.这是我使用 for 循环的实现
//创建空容器val 篮子 = scala.collection.mutable.Map[String, Set[String]]()//遍历篮子的所有数字索引 (b<i>)对于 (i <- 1 到 4) {篮子(b"+ i.toString)= Set();}//遍历 df 中的每一行并将项目存储到集合中df.collect().foreach(row =>篮子(行(0).toString)+=行(1).toString)
你可以简单地做 aggregateByKey 操作然后 collectItAsMap 会直接给你想要的结果.它比简单的 groupBy 高效得多.
import scala.collection.mutablecase class Items(basketID: String,itemID: String)导入 spark.implicits._val 结果 = output.as[Items].rdd.map(x => (x.basketID,x.itemID)).aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())((l: mutable.Buffer[String], p: String) => l += p ,(l1: mutable.Buffer[String], l2: mutable.Buffer[String]) =>(l1 ++ l2).distinct).collectAsMap();
您可以通过此处.另请检查 aggregateByKey vs groupByKeyReduceByKey 差异.
i'm working with a dataframe with the columns basketID
and itemID
. is there a way to efficiently parse through the dataset and generate a map where the keys are basketID
and the value is a set of all the itemID
contained within each basket?
my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks!
the goal is to obtain basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6"))
. heres the implementation I have using a for loop
// create empty container
val basket = scala.collection.mutable.Map[String, Set[String]]()
// loop over all numerical indexes for baskets (b<i>)
for (i <- 1 to 4) {
basket("b" + i.toString) = Set();
}
// loop over every row in df and store the items to the set
df.collect().foreach(row =>
basket(row(0).toString) += row(1).toString
)
You can simply do aggregateByKey operation then collectItAsMap will directly give you the desired result. It is much more efficient than simple groupBy.
import scala.collection.mutable
case class Items(basketID: String,itemID: String)
import spark.implicits._
val result = output.as[Items].rdd.map(x => (x.basketID,x.itemID))
.aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())
((l: mutable.Buffer[String], p: String) => l += p ,
(l1: mutable.Buffer[String], l2: mutable.Buffer[String]) => (l1 ++ l2).distinct)
.collectAsMap();
you can check other aggregation api's like reduceBy and groupBy over here. please also check aggregateByKey vs groupByKey vs ReduceByKey differences.
这篇关于如何有效地将数据帧对象解析为键值对映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!