将数据框火花到嵌套的地图 [英] Spark dataframe to nested map
问题描述
如何将火花(最大300 MB)中的一个相当小的数据帧转换为嵌套地图,以便改善晶石的DAG。我相信这个操作会比之后的加入更快( Spark动态DAG比硬编码DAG慢很多,因为转换后的值是在自定义估算器的训练阶段创建的。
How can I convert a rather small data frame in spark (max 300 MB) to a nested map in order to improve spar's DAG. I believe this operation will be quicker than a join later on (Spark dynamic DAG is a lot slower and different from hard coded DAG) as the transformed values were created during the train step of a custom estimator. Now I just want to apply them really quick during predict step of the pipeline.
val inputSmall = Seq(
("A", 0.3, "B", 0.25),
("A", 0.3, "g", 0.4),
("d", 0.0, "f", 0.1),
("d", 0.0, "d", 0.7),
("A", 0.3, "d", 0.7),
("d", 0.0, "g", 0.4),
("c", 0.2, "B", 0.25)).toDF("column1", "transformedCol1", "column2", "transformedCol2")
这给出了错误的地图类型
This gives the wrong type of map
val inputToMap = inputSmall.collect.map(r => Map(inputSmall.columns.zip(r.toSeq):_*))
我宁愿要这样的东西:
Map[String, Map[String, Double]]("column1" -> Map("A" -> 0.3, "d" -> 0.0, ...), "column2" -> Map("B" -> 0.25), "g" -> 0.4, ...)
推荐答案
编辑:从最终地图中移除收集操作
removed collect operation from final map
如果您是使用Spark 2+,这里有一个建议:
If you are using Spark 2+, here's a suggestion:
val inputToMap = inputSmall.select(
map($"column1", $"transformedCol1").as("column1"),
map($"column2", $"transformedCol2").as("column2")
)
val cols = inputToMap.columns
val localData = inputToMap.collect
cols.map { colName =>
colName -> localData.flatMap(_.getAs[Map[String, Double]](colName)).toMap
}.toMap
这篇关于将数据框火花到嵌套的地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!