如何有效地将数据帧对象解析为键值对映射 [英] how to efficiently parse dataframe object into a map of key-value pairs

查看:26
本文介绍了如何有效地将数据帧对象解析为键值对映射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个包含 basketIDitemID 列的数据框.有没有办法有效地解析数据集并生成一个映射,其中键是 basketID,值是每个篮子中包含的所有 itemID 的集合?>

我当前的实现在数据框上使用 for 循环,这不是很可扩展.有没有可能更有效地做到这一点?任何帮助将不胜感激谢谢!

示例数据的屏幕截图

目标是获得 basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set(";i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6")).这是我使用 for 循环的实现

//创建空容器val 篮子 = scala.collection.mutable.Map[String, Set[String]]()//遍历篮子的所有数字索引 (b<i>)对于 (i <- 1 到 4) {篮子(b"+ i.toString)= Set();}//遍历 df 中的每一行并将项目存储到集合中df.collect().foreach(row =>篮子(行(0).toString)+=行(1).toString)

解决方案

你可以简单地做 aggregateByKey 操作然后 collectItAsMap 会直接给你想要的结果.它比简单的 groupBy 高效得多.

import scala.collection.mutablecase class Items(basketID: String,itemID: String)导入 spark.implicits._val 结果 = output.as[Items].rdd.map(x => (x.basketID,x.itemID)).aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())((l: mutable.Buffer[String], p: String) => l += p ,(l1: mutable.Buffer[String], l2: mutable.Buffer[String]) =>(l1 ++ l2).distinct).collectAsMap();

您可以通过此处.另请检查 aggregateByKey vs groupByKeyReduceByKey 差异.

i'm working with a dataframe with the columns basketID and itemID. is there a way to efficiently parse through the dataset and generate a map where the keys are basketID and the value is a set of all the itemID contained within each basket?

my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks!

screen shot of sample data

the goal is to obtain basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6")). heres the implementation I have using a for loop

// create empty container
val basket = scala.collection.mutable.Map[String, Set[String]]()
// loop over all numerical indexes for baskets (b<i>)
for (i <- 1 to 4) {
  basket("b" + i.toString) = Set();
}
// loop over every row in df and store the items to the set
df.collect().foreach(row => 
  basket(row(0).toString) += row(1).toString
)

解决方案

You can simply do aggregateByKey operation then collectItAsMap will directly give you the desired result. It is much more efficient than simple groupBy.

import scala.collection.mutable
case class Items(basketID: String,itemID: String)
 
 import spark.implicits._
 val result = output.as[Items].rdd.map(x => (x.basketID,x.itemID))
.aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())
 ((l: mutable.Buffer[String], p: String) => l += p , 
 (l1: mutable.Buffer[String], l2: mutable.Buffer[String]) => (l1 ++ l2).distinct)
.collectAsMap();

you can check other aggregation api's like reduceBy and groupBy over here. please also check aggregateByKey vs groupByKey vs ReduceByKey differences.

这篇关于如何有效地将数据帧对象解析为键值对映射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆