如何(平均)在Spark数据帧中对数组数据进行分区 [英] How to (equally) partition array-data in spark dataframe

查看：60 发布时间：2020/9/4 7:16:14 scala apache-spark

本文介绍了如何(平均)在Spark数据帧中对数组数据进行分区的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我具有以下形式的数据框:

I have a dataframe of the following form:

import scala.util.Random
val localData = (1 to 100).map(i => (i,Seq.fill(Math.abs(Random.nextGaussian()*100).toInt)(Random.nextDouble)))
val df = sc.parallelize(localData).toDF("id","data")

|-- id: integer (nullable = false)
|-- data: array (nullable = true)
|    |-- element: double (containsNull = false)


df.withColumn("data_size",size($"data")).show

+---+--------------------+---------+
| id|                data|data_size|
+---+--------------------+---------+
|  1|[0.77845301260182...|      217|
|  2|[0.28806915178410...|      202|
|  3|[0.76304121847720...|      165|
|  4|[0.57955190088558...|        9|
|  5|[0.82134215959459...|       11|
|  6|[0.42193739241567...|       57|
|  7|[0.76381645621403...|        4|
|  8|[0.56507523859466...|       93|
|  9|[0.83541853717244...|      107|
| 10|[0.77955626749231...|      111|
| 11|[0.83721643562080...|      223|
| 12|[0.30546029947285...|      116|
| 13|[0.02705462199952...|       46|
| 14|[0.46646815407673...|       41|
| 15|[0.66312488908446...|       16|
| 16|[0.72644646115640...|      166|
| 17|[0.32210572380128...|      197|
| 18|[0.66680355567329...|       61|
| 19|[0.87055594653295...|       55|
| 20|[0.96600507545438...|       89|
+---+--------------------+---------+

现在我想应用一个昂贵的UDF，计算时间大约与数据数组的大小成正比.我想知道如何重新分区数据，以便每个分区具有大约相同数量的"records * data_size"(即数据点不只是记录).

Now I want to apply an expensive UDF, the time for the computation is ~ proportional to the size of the data array. I wodner how I can repartition my data such that each partition has approximatively the same number of "records*data_size" (i.e., data points NOT just records).

如果只是执行df.repartition(100)，我可能会得到一些包含一些非常大的数组的分区，这就是整个spark阶段的瓶颈(所有其他任务已经完成).如果可以，我当然可以选择大量的分区，这将(几乎)确保每个记录都位于单独的分区中.但是还有另一种方法吗?

If just do df.repartition(100), I may get some partitons containing some very large arrays which are then the bottleneck of the entire spark stage (all other taks being already finished). If course I could just chose an insane amount of partitions which will (almost) ensure that each record is in a separate partition. But is there another way?

如何(平均)在Spark数据帧中对数组数据进行分区 [英] How to (equally) partition array-data in spark dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何(平均)在Spark数据帧中对数组数据进行分区 [英] How to (equally) partition array-data in spark dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭