spark 2.x与mapPartitions大量记录并行处理 [英] spark 2.x with mapPartitions large number of records parallel processing

查看：601 发布时间：2020/10/17 2:42:40 scala dataframe apache-spark parallel-processing

本文介绍了spark 2.x与mapPartitions大量记录并行处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将Spark mapPartitions与Datasets [Spark 2.x]结合使用，以将大型文件列表[100万条记录]从一个位置并行复制到另一个位置。
但是，有时，我看到一条记录被多次复制。

I am trying to use spark mapPartitions with Datasets[Spark 2.x] for copying large list of files [1 million records] from one location to another in parallel. However, at times, I am seeing that one record is getting copied multiple times.

该想法是将100万个文件拆分为多个分区（此处为，24）。然后对每个分区并行执行复制操作，最后从每个分区获取结果以执行进一步的操作。

The idea is to split 1 million files into number of partitions (here, 24). Then for each partition, perform copy operation in parallel and finally get result from each partition to perform further actions.

有人可以告诉我我做错了什么吗？ p>

Can someone please tell me what am I doing wrong?

  def process(spark: SparkSession): DataFrame = {
  import spark.implicits._
  //Get source and target List for 1 million records
  val sourceAndTargetList =
    List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))

  // convert list to dataframe with number of partitions as 24
  val SourceTargetDataSet =
    sourceAndTargetList.toDF.repartition(24).as[(String, String)]
  var dfBuffer = new ListBuffer[DataFrame]()
  dfBuffer += SourceTargetDataSet
    .mapPartitions(partition => {
      println("partition id: " + TaskContext.getPartitionId)
      //for each partition
      val result = partition
        .map(row => {
          val source = row._1
          val target = row._2
          val copyStatus = copyFiles(source, target) // Function to copy files that returns a boolean
          val dataframeRow = (target, copyStatus)
          dataframeRow
        })
        .toList

      result.toIterator
    })
    .toDF()

  val dfList = dfBuffer.toList
  val newDF = dfList.tail.foldLeft(dfList.head)(
    (accDF, newDF) => accDF.join(newDF, Seq("_1"))
  )

  println("newDF Count " + newDF.count)
  newDF
}

更新2：我更改了函数，如下所示，到目前为止，它给了我一致的预期结果。我可以知道自己在做错什么吗，是否可以使用下面的函数获取所需的并行化？如果没有，如何进行优化？

Update 2: I changed the function as shown below and so far it is giving me consistent results as expected. May I know what I was doing wrong and am I getting the required parallelization using below function? If not, how can this be optimized?

def process(spark: SparkSession): DataFrame = {
  import spark.implicits._
  //Get source and target List for 1 miilion records
  val sourceAndTargetList =
    List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))

  // convert list to dataframe with number of partitions as 24
  val SourceTargetDataSet =
    sourceAndTargetList.toDF.repartition(24).as[(String, String)]
  val iterator = SourceTargetDataSet.toDF
    .mapPartitions(
      (it: Iterator[Row]) =>
        it.toList
          .map(row => {
            println(row)

            val source = row.toString.split(",")(0).drop(1)
            val target = row.toString.split(",")(1).dropRight(1)
            println("source : " + source)
            println("target: " + target)
            val copyStatus = copyFiles() // Function to copy files that returns a boolean
            val dataframeRow = (target, copyStatus)
            dataframeRow
          })
          .iterator
    )
    .toLocalIterator

  val df = y.toList.toDF("targetKey", "copyStatus")
  df
}

spark 2.x与mapPartitions大量记录并行处理 [英] spark 2.x with mapPartitions large number of records parallel processing

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

spark 2.x与mapPartitions大量记录并行处理 [英] spark 2.x with mapPartitions large number of records parallel processing

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭