减少JDBC读取并行性的合并 [英] Coalesce reducing JDBC read parallelism
问题描述
我利用Spark
的JDBC
功能如下:
- 将
MySQL
表读入DataFrame
- 转换它们
- Coalesce 他们
- 将它们写入
HDFS
- Read
MySQL
tables intoDataFrame
- Transform them
- Coalesce them
- Write them to
HDFS
在DataFrame
的整个生命周期中,不执行action
..它曾经按预期工作,但最近我遇到了问题.由于Spark
的 lazy评估,coalesce
导致读取操作的并行度降低.
Throughout the lifespan of DataFrame
, no action
s are performed on it. It used to work as expected but lately I've been experiencing problems. Thanks to Spark
's lazy evaluation, the coalesce
is resulting in reduced parallelism of read operation.
因此,如果我使用DataFrameReader.jdbc(..numPartitions..)
和numPartitions=42
读取DataFrame
,然后在写入前将coalesce
保持为6 partition
s,则它会以并发性读取DataFrame
(仅6个)(仅向MySQL
发出6个查询).我想更早地重复使用它以42的 parallelism 进行读取,然后再执行coalesce
.
So if I read DataFrame
using DataFrameReader.jdbc(..numPartitions..)
with numPartitions=42
, and then coalesce
it to 6 partition
s before writing, then it reads the DataFrame
with a concurrency of only 6 (fire only 6 queries to MySQL
). I'd like to repeat that earlier it used read with parallelism of 42 and perform coalesce
afterwards.
我最近迁移到EMR 5.13
上的Spark 2.3.0
,这可能与此有关吗?有解决方法吗?
I've recently migrated to Spark 2.3.0
on EMR 5.13
, could this be related to that? Is there a workaround?
推荐答案
由于Spark的懒惰评估,合并导致读取操作的并行性降低.
Thanks to Spark's lazy evaluation, the coalesce is resulting in reduced parallelism of read operation.
It has nothing to do with laziness. coalesce
intentionally doesn't create analysis barrier:
但是,如果您要进行剧烈的合并,例如到numPartitions = 1,这可能会导致您的计算在少于您希望的节点上进行(例如,在numPartitions = 1的情况下为一个节点).为避免这种情况,您可以调用重新分区.这将增加一个随机播放步骤,但是意味着当前的上游分区将并行执行(无论当前分区是什么).
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
因此,只需遵循文档并使用repartition
而不是coalesce
.
So just follow the documentation and use repartition
instead of coalesce
.
这篇关于减少JDBC读取并行性的合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!