流式pyspark应用程序中的连接池 [英] Connection pooling in a streaming pyspark application

查看:149
本文介绍了流式pyspark应用程序中的连接池的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在流式pyspark应用程序中使用连接池的正确方法是什么?

​What is the proper way of using connection pools in a streaming pyspark application ?

我阅读了 https://forums.databricks.com/questions/3057/how-to-reuse-database-session-object-created-in-fo.html ,并了解正确的方法是将单例用于scala/java.这可能在python中吗?一个小的代码示例将不胜感激.我相信为流应用程序创建一个perPartition连接将非常低效.

I read through https://forums.databricks.com/questions/3057/how-to-reuse-database-session-object-created-in-fo.html and understand the proper way is to use a singleton for scala/java. Is this possible in python ? A small code example would be greatly appreciated. I believe creating a connection perPartition will be very inefficient for a streaming application.

推荐答案

由于PySpark的体系结构,长话短说的连接池在Python中比在JVM中更有用.与它的Scala不同,Python执行程序使用单独的过程.这意味着执行器之间没有共享状态,并且由于默认情况下每个分区都是按顺序处理的,因此每个解释器只能有一个活动连接.

Long story short connection pools will be less useful in Python than on JVM due to PySpark architecture. Unlike its Scala counterpart Python executors use separate processes. It means there is no shared state between executors and since by default each partition is processed sequentially you can have only one active connection per interpreter.

当然,保持批次之间的连接仍然有用.要实现这一点,您需要做两件事:

Of course it can be still useful to maintain connections between batches. To achieve that you'll need two things:

  • spark.python.worker.reuse必须设置为true.
  • 在不同调用之间引用对象的方法.
  • spark.python.worker.reuse has to be set to true.
  • A way to reference an object between different calls.

第一个很明显,第二个并不是特定于Spark的.例如,您可以使用模块单例(您会在我对如何在处理数据之前在所有Spark工作者上运行函数的答案中找到Spark示例)在PySpark中?)或博格模式.

The first one is pretty obvious and the second one is not really Spark specific. You can for example use module singleton (you'll find Spark example in my answer to How to run a function on all Spark workers before processing data in PySpark?) or a Borg pattern.

这篇关于流式pyspark应用程序中的连接池的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆