通过JDBC并行化-Pyspark-使用JDBC并行化如何工作? [英] Parallelization via JDBC - Pyspark - How does parallelization work using JDBC?

查看:264
本文介绍了通过JDBC并行化-Pyspark-使用JDBC并行化如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用JDBC并行化?

这是我的代码:

spark = SparkSession.builder.getOrCreate()
DF    = spark.read.jdbc( url           =  ...,
                         table         = '...',
                         column        = 'XXXX',
                         lowerBound    =  Z,
                         upperBound    =  Y,
                         numPartitions = K
                         )

我想知道以下参数之间的关系,以及是否有办法正确选择它们:

I would like to know how are related the following parameters and if there is a way to choose them properly:

  1. column ->应该是为分区选择的列
    (是否需要为数字列?)
  2. lowerBound ->是否有经验选择它?
  3. upperBound ->是否有经验选择它?
  4. numPartitions ->是否有经验选择它?

我了解

stride = ( upperBound / numPartitions ) - ( lowerBound / numPartitions )

每个分区中有很多大步向前"吗?

Are there many "strides" in each partition?

换句话说,在所有观察结束之前,分区是否充满了一大步?

In other words, are the partitions filled with a bunch of strides until all the observations has finished?

请查看这张图片 ,请考虑以下参数,以了解问题的含义:

Please, look at this picture to get the sense of the question, considering the following parameters:

 lowerBound     80.000
 upperBound    180.000
 numPartitions       8
 Stride         12.500

请注意:

 min('XXXX')      =           0
 max('XXXX')      =     350.000
 ('XXXX').count() = 500.000.000

P.S.我阅读了文档

P.S. I read the documentation and this answer, but I didn't understand it very well.

推荐答案

  1. 是的,根据文档说明,必须是数字列.为什么?因为否则您将无法计算步幅,即(upperBound-lowerBound)/numPartitions = 12.500(每个分区的项目)

  1. Yes it column needs to be a numeric column according to documentation. Why? Because otherwise you can't calculate the stride which is (upperBound - lowerBound) / numPartitions = 12.500(items per partition)

我认为,如果列已经是数据库中的索引列,那将是理想的选择,因为您将需要尽快检索这些记录.然后,upperBound和LowerBound应该是要检索到spark中的数据的边界(例如,考虑column = id,那么您需要的数据可能是id between 1 and max(id)).

I think it would be ideal if column is already an indexed column in your database since you will need to retrieve these records as fast as possible. Then upperBound and lowerBound should be the boundaries of your data to retrieve into spark(e.g consider column=id then the data you will need could be id between 1 and max(id)).

对于所有情况而言,正确的numPartitions都是很难解决的主题.需要注意的一个经典问题是连接池的大小.例如,您应避免并行创建比池可以处理的连接更多的连接.当然,并行连接的数量直接与分区的数量相连.例如,如果您有8个最大分区,请确保并行连接的最大数量也为8.有关如何为numPartitions选择正确的值的更多信息,请检查

The right numPartitions is a difficult topic to approach precisely for all the cases. One classic issue to be aware of though is the size of your connection pool. You should avoid for instance creating more connections in parallel than your pool can handle. Of course the number of parallel connections is directly connected to the number of partitions. For example if you have 8 max partitions you ensure that the maximum number of parallel connections is also 8. For more about how to choose a right value for numPartitions you can check this

祝你好运

这篇关于通过JDBC并行化-Pyspark-使用JDBC并行化如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆