SparkSQL PostgresQL数据框分区 [英] SparkSQL PostgresQL Dataframe partitions
问题描述
我有一个非常简单的SparkSQL连接到Postgres DB的设置,我正在尝试从一个表中获取一个DataFrame,该Dataframe的分区数为X(说2).代码如下:
I have a very simple setup of SparkSQL connecting to a Postgres DB and I'm trying to get a DataFrame from a table, the Dataframe with a number X of partitions (lets say 2). The code would be the following:
Map<String, String> options = new HashMap<String, String>();
options.put("url", DB_URL);
options.put("driver", POSTGRES_DRIVER);
options.put("dbtable", "select ID, OTHER from TABLE limit 1000");
options.put("partitionColumn", "ID");
options.put("lowerBound", "100");
options.put("upperBound", "500");
options.put("numPartitions","2");
DataFrame housingDataFrame = sqlContext.read().format("jdbc").options(options).load();
由于某种原因,DataFrame的一个分区几乎包含所有行.
For some reason, one partition of the DataFrame contains almost all rows.
据我了解,lowerBound/upperBound
是用于微调此参数的参数.在SparkSQL的文档中(Spark 1.4.0-spark-sql_2.11),它说它们是用来定义跨度的,而不是过滤/排列分区列.但这引发了几个问题:
For what I can understand lowerBound/upperBound
are the parameters used to finetune this. In SparkSQL's documentation (Spark 1.4.0 - spark-sql_2.11) it says they are used to define the stride, not to filter/range the partition column. But that raises several questions:
- 跨度是Spark为每个执行器(分区)查询数据库的频率(每个查询返回的元素数)?
- 如果不是,则此参数的用途是什么,它们取决于什么,以及如何以稳定的方式平衡DataFrame分区(不要求所有分区都包含相同数量的元素,只是存在一个平衡-例如2个分区100个元素55/45,60/40甚至65/35都可以)
似乎无法找到关于这些问题的明确答案,并且想知道是否有些人可以为我澄清这一点,因为现在当处理X百万行并且所有繁重的工作都在影响着我的集群性能交给一位执行人.
Can't seem to find a clear answer to these questions around and was wondering if maybe some of you could clear this points for me, because right now is affecting my cluster performance when processing X million rows and all the heavy lifting goes to one single executor.
干杯,谢谢您的宝贵时间.
Cheers and thanks for your time.
推荐答案
下界确实用于分区列;请参阅此代码(撰写本文时为当前版本):
Lower bound are indeed used against the partitioning column; refer to this code (current version at the moment of writing this):
函数columnPartition
包含分区逻辑的代码以及上下限的使用.
Function columnPartition
contains the code for the partitioning logic and the use of lower / upper bound.
这篇关于SparkSQL PostgresQL数据框分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!