通过JDBC从RDBMS读取时对spark进行分区 [英] Partitioning in spark while reading from RDBMS via JDBC
问题描述
我正在集群模式下运行spark,并通过JDBC从RDBMS读取数据.
I am running spark in cluster mode and reading data from RDBMS via JDBC.
根据Spark docs ,这些分区参数描述了从多个工作程序并行读取时如何对表进行分区:
As per Spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple workers:
-
partitionColumn
-
lowerBound
-
upperBound
-
numPartitions
partitionColumn
lowerBound
upperBound
numPartitions
这些是可选参数.
如果我不指定这些将会发生什么:
What would happen if I don't specify these:
- 只有1个工人读取了全部数据?
- 如果仍然并行读取,如何对数据进行分区?
推荐答案
如果您未指定{partitionColumn
,lowerBound
,upperBound
,numPartitions
}或{predicates
},则Spark将使用一个执行器,并创建一个非空分区.所有数据都将使用单个事务处理,并且读取既不会分发也不会并行化.
If you don't specify either {partitionColumn
, lowerBound
, upperBound
, numPartitions
} or {predicates
} Spark will use a single executor and create a single non-empty partition. All data will be processed using a single transaction and reads will be neither distributed nor parallelized.
另请参阅:
- How to optimize partitioning when migrating data from JDBC source?
- How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
这篇关于通过JDBC从RDBMS读取时对spark进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!