JDBC到Spark Dataframe-如何确保均匀分区? [英] JDBC to Spark Dataframe - How to ensure even partitioning?
问题描述
我是Spark的新手,并且正在使用spark.read.jdbc
通过JDBC通过Postgres数据库表创建DataFrame.
I am new to Spark, and am working on creating a DataFrame from a Postgres database table via JDBC, using spark.read.jdbc
.
我对分区选项有些困惑,尤其是 partitionColumn , lowerBound , upperBound 和 numPartitions >.
I am a bit confused about the partitioning options, in particular partitionColumn, lowerBound, upperBound, and numPartitions.
- 文档似乎表明这些字段是可选的. 如果我不提供它们会怎样?
- Spark如何知道如何对查询进行分区?效率如何?
- 如果我确实指定了这些选项,那么即使partitionColumn没有均匀分布,如何确保分区大小是大致的?
- The documentation seems to indicate that these fields are optional. What happens if I don't provide them?
- How does Spark know how to partition the queries? How efficient will that be?
- If I DO specify these options, how do I ensure that the partition sizes are roughly even if the partitionColumn is not evenly distributed?
假设我要有20个执行者,所以我将numPartitions设置为20.
我的partitionColumn是一个自动递增的ID字段,假设值的范围是1到2,000,000
但是,由于用户选择处理一些真正的旧数据以及一些真正的新数据,而中间没有任何内容,因此大多数数据的ID值都在100,000以下或1,900,000以上.
Let's say I'm going to have 20 executors, so I set my numPartitions to 20.
My partitionColumn is an auto-incremented ID field, and let's say the values range from 1 to 2,000,000
However, because the user selects to process some really old data, along with some really new data, with nothing in the middle, most of the data has ID values either under 100,000 or over 1,900,000.
-
我的第1位执行者和第20位执行者会获得大部分工作,而其他18位执行者大多在那里闲着吗?
Will my 1st and 20th executors get most of the work, while the other 18 executors sit there mostly idle?
如果是这样,有办法防止这种情况吗?
If so, is there a way to prevent this?
推荐答案
所有这些选项是什么:spark.read.jdbc
是指从RDBMS读取表.
What are all these options : spark.read.jdbc
refers to reading a table from RDBMS.
并行是火花的力量,为了实现这一点,您必须提及所有这些选项.
parallelism is power of spark, in order to achieve this you have to mention all these options.
问题:-)
1)文档似乎表明这些字段是可选的.如果我不提供它们会怎样?
1) The documentation seems to indicate that these fields are optional. What happens if I don't provide them ?
答案:默认并行性或并行性差
Answer : default Parallelism or poor parallelism
基于方案,开发人员必须注意性能调整策略.并确保跨边界(又称分区)的数据拆分,而这又将是并行任务.通过这种方式.
Based on scenario developer has to take care about the performance tuning strategy. and to ensure data splits across the boundaries (aka partitions) which in turn will be tasks in parallel. By seeing this way.
2)Spark如何知道如何对查询进行分区?效率如何?
2) How does Spark know how to partition the queries? How efficient will that be?
您可以根据数据集的列值提供分割边界.
You can provide split boundaries based on the dataset’s column values.
- 这些选项指定读取时的并行性.
- 如果指定了这些选项,则必须全部指定.
注意
这些选项指定读取的表的并行性. lowerBound
和upperBound
决定分区的步幅,但不决定
过滤表中的行.因此,Spark分区并返回所有
表中的行.
These options specify the parallelism of the table read. lowerBound
and upperBound
decide the partition stride, but do not
filter the rows in table. Therefore, Spark partitions and returns all
rows in the table.
示例1:
您可以使用partitionColumn
,lowerBound
,upperBound
和numPartitions
参数在emp_no
列上的执行程序之间拆分读取的表.
Example 1:
You can split the table read across executors on the emp_no
column using the partitionColumn
, lowerBound
, upperBound
, and numPartitions
parameters.
val df = spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties)
也numPartitions
表示您要RDBMS读取数据的并行连接数.如果您要提供numPartitions,那么您将限制连接数...而不会耗尽RDBMS端的连接..
also numPartitions
means number of parllel connections you are asking RDBMS to read the data. if you are providing numPartitions then you are limiting number of connections... with out exhausing the connections at RDBMS side..