如何将 COGROUP 用于大型数据集 [英] How to use COGROUP for large datasets

查看：18 发布时间：2021/11/12 5:36:17 scala apache-spark apache-spark-sql

本文介绍了如何将 COGROUP 用于大型数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个 rdd，即 val tab_a: RDD[(String, String)] 和 val tab_b: RDD[(String, String)]代码> 我正在使用 cogroup 来处理这些数据集，例如:

I have two rdd's namely val tab_a: RDD[(String, String)] and val tab_b: RDD[(String, String)] I'm using cogroup for those datasets like:

val tab_c = tab_a.cogroup(tab_b).collect.toArray

val updated = tab_c.map { x =>
  {
 //somecode
  }
}

我正在使用 tab_c 组合值作为 map 函数，它适用于小数据集，但在大数据集的情况下，它会引发 Out Of Memory 异常.

I'm using tab_c cogrouped values for map function and it works fine for small datasets but in case of huge datasets it throws Out Of Memory exception.

我尝试将最终值转换为 RDD 但没有运气同样的错误

I have tried converting the final value to RDD but no luck same error

val newcos = spark.sparkContext.parallelize(tab_c)

1.大型数据集如何使用Cogroup?

1.How to use Cogroup for large datasets ?

2.我们可以持久化组合值吗?

2.Can we persist the cogrouped value ?

代码

 val source_primary_key = source.map(rec => (rec.split(",")(0), rec))
source_primary_key.persist(StorageLevel.DISK_ONLY)

val destination_primary_key = destination.map(rec => (rec.split(",")(0), rec))
destination_primary_key.persist(StorageLevel.DISK_ONLY)

val cos = source_primary_key.cogroup(destination_primary_key).repartition(10).collect()

  var srcmis: Array[String] = new Array[String](0)
var destmis: Array[String] = new Array[String](0)

var extrainsrc: Array[String] = new Array[String](0)
var extraindest: Array[String] = new Array[String](0)

var srcs: String = Seq("")(0)
var destt: String = Seq("")(0)

val updated = cos.map { x =>
  {

    val key = x._1
    val value = x._2

    srcs = value._1.mkString(",")
    destt = value._2.mkString(",")

    if (srcs.equalsIgnoreCase(destt) == false && destt != "") {
      srcmis :+= srcs
      destmis :+= destt

    }

    if (srcs == "") {

      extraindest :+= destt.mkString("")
    }

    if (destt == "") {

      extrainsrc :+= srcs.mkString("")
    }

  }

}

代码更新:

  val tab_c = tab_a.cogroup(tab_b).filter(x => x._2._1 =!= x => x._2._2)
 // tab_c = {1,Compactbuffer(1,john,US),Compactbuffer(1,john,UK)}
      {2,Compactbuffer(2,john,US),Compactbuffer(2,johnson,UK)}..

错误:

 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(4,3,ResultTask,FetchFailed(null,0,-1,27,org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)


ERROR YarnScheduler: Lost executor 8 on datanode1: Container killed by YARN for exceeding memory limits. 1.0 GB of 1020 MB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

谢谢

如何将 COGROUP 用于大型数据集 [英] How to use COGROUP for large datasets

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何将 COGROUP 用于大型数据集 [英] How to use COGROUP for large datasets

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭