在Apache Spark代码的调用方法中使用memSQL Connection对象的正确方法是什么 [英] What is the correct way of using memSQL Connection object inside call method of Apache Spark code

查看:73
本文介绍了在Apache Spark代码的调用方法中使用memSQL Connection对象的正确方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个火花代码,其中Call方法中的代码对memSQL数据库进行调用以从表中读取.我的代码每次都会打开一个新的连接对象,并在任务完成后将其关闭.该调用是从Call方法内部进行的.这可以正常工作,但是Spark作业的执行时间变长.这样做会更好,这样可以减少火花代码的执行时间.

I have a spark code where the code inside Call method makes call to the memSQL database for reading from a table. My code opens a new connection object each time and closes it after the task is done. This call is made from inside the Call method. This works fine but the execution time for Spark job becomes high. What would be a better way to do this so that the spark code execution time is reduced.

谢谢.

推荐答案

每个分区可以使用一个连接,如下所示:

You can use one connection per partition, like this:

rdd.foreachPartition {records =>
  val connection = DB.createConnection()
  //you can use your connection instance inside foreach
  records.foreach { r=>
    val externalData = connection.read(r.externaId)
    //do something with your data
  }
  DB.save(records)
  connection.close()
}

如果您使用Spark Streaming:

If you use Spark Streaming:

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { records =>
    val connection = DB.createConnection()
    //you can use your connection instance inside foreach
    records.foreach { r=>
      val externalData = connection.read(r.externaId)
      //do something with your data
    }
    DB.save(records)
    connection.close()
  }
}

请参见 http://spark .apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

这篇关于在Apache Spark代码的调用方法中使用memSQL Connection对象的正确方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆