partitionColumn,lowerBound,upperBound,numPartitions参数是什么意思? [英] What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
问题描述
当通过Spark中的JDBC连接从SQL Server获取数据时,我发现我可以设置一些并行化参数,例如partitionColumn
,lowerBound
,upperBound
和numPartitions
.我已经阅读过火花文档但无法理解.
While fetching data from SQL Server via a JDBC connection in Spark, I found that I can set some parallelization parameters like partitionColumn
, lowerBound
, upperBound
, and numPartitions
. I have gone through spark documentation but wasn't able to understand it.
任何人都可以向我解释这些参数的含义吗?
Can anyone explain me the meanings of these parameters?
推荐答案
这很简单:
-
partitionColumn
是应用于确定分区的列. -
lowerBound
和upperBound
确定要获取的值的范围.完整的数据集将使用与以下查询对应的行:
partitionColumn
is a column which should be used to determine partitions.lowerBound
andupperBound
determine range of values to be fetched. Complete dataset will use rows corresponding to the following query:
SELECT * FROM table WHERE partitionColumn BETWEEN lowerBound AND upperBound
numPartitions
确定要创建的分区数. lowerBound
和upperBound
之间的范围分为numPartitions
,每个步幅等于:
numPartitions
determines number of partitions to be created. Range between lowerBound
and upperBound
is divided into numPartitions
each with stride equal to:
upperBound / numPartitions - lowerBound / numPartitions
例如,如果:
-
lowerBound
:0 -
upperBound
:1000 -
numPartitions
:10
lowerBound
: 0upperBound
: 1000numPartitions
: 10
步幅等于100,分区对应于以下查询:
Stride is equal to 100 and partitions correspond to following queries:
-
SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100
-
SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200
-
...
-
SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000
SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100
SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200
...
SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000
这篇关于partitionColumn,lowerBound,upperBound,numPartitions参数是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!