如何避免在GeoSpark的范围查询中超出gc开销限制? [英] How to avoid gc overhead limit exceeded in a range query with GeoSpark?

查看:155
本文介绍了如何避免在GeoSpark的范围查询中超出gc开销限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark 2.4.3,扩展名为GeoSpark 1.2.0.

I am using Spark 2.4.3 with the extension of GeoSpark 1.2.0.

我有两个表作为范围距离加入.一个表(t1),如果〜100K行只有一列,那是Geospark的几何形状.另一个表(t2)是约3000万行,由Int值和Geospark的几何列组成.

I have two tables to join as range distance. One table (t1) if ~ 100K rows with one column only that is a Geospark's geometry. The other table (t2) is ~ 30M rows and it is composed by an Int value and a Geospark's geometry column.

我想做的只是一个简单的事情:

What I am trying to do is just a simple:

    val spark = SparkSession
      .builder()
//      .master("local[*]")
      .config("spark.serializer", classOf[KryoSerializer].getName)
      .config("spark.kryo.registrator", classOf[GeoSparkKryoRegistrator].getName)
      .config("geospark.global.index", "true")
      .config("geospark.global.indextype", "rtree")
      .config("geospark.join.gridtype", "rtree")
      .config("geospark.join.numpartition", 200)
      .config("spark.sql.parquet.filterPushdown", "true")
//      .config("spark.sql.shuffle.partitions", 10000)
      .config("spark.sql.autoBroadcastJoinThreshold", -1)
      .appName("PropertyMaster.foodDistanceEatout")
      .getOrCreate()

GeoSparkSQLRegistrator.registerAll(spark)

spark.sparkContext.setLogLevel("ERROR")

spark.read
  .load(s"$dataPath/t2")
  .repartition(200)
  .createOrReplaceTempView("t2")

spark.read
  .load(s"$dataPath/t1")
  .repartition(200)
  .cache()
  .createOrReplaceTempView("t1")

val query =
  """
    |select /*+ BROADCAST(t1) */
    |  t2.cid, ST_Distance(t1.geom, t2.geom) as distance
    |  from t2, t1 where ST_Distance(t1.geom, t2.geom) <= 3218.69""".stripMargin

spark.sql(query)
  .repartition(200)
  .write.mode(SaveMode.Append)
  .option("path", s"$dataPath/my_output.csv")
  .format("csv").save()

我尝试了不同的配置,无论是在本地运行还是在笔记本电脑的本地群集上运行(tot mem 16GB和8核),但都没有运气,因为该程序在GeoSpark的加入时不同"崩溃,并经过大量改组.但是我无法从SparkSQL语法中删除混洗.我想在最大的表上添加一个额外的列ID,例如每200行左右添加一个相同的整数,然后重新分区,但是也没有用.

I tried different configurations, cboth when I run it locally or on my local cluster on my laptop (tot mem 16GB and 8 cores) but without any luck as the program crashes at "Distinct at Join" for GeoSpark with lots of shuffling. However I am not able to remove the shuffling from SparkSQL syntax. I thought to add an extra column id on the biggest table as for example same integer every 200 rows or so and repartition by that, but didn't work too.

我期望使用GeoSpark索引的分区程序,但是我不确定它是否正常工作.

I was expecting a partitioner for GeoSpark indexing but I am not sure it is working.

有什么主意吗?

推荐答案

我自己找到了一个答案,因为GC开销的问题是由于分区,还有GeoSpark为Partitioner所需的内存(基于索引)以及由于解决了冗长的地理查询而导致的超时,并根据GeoSpark网站本身的建议添加了以下参数:

I have found an answer myself, as the problem of the GC overhead was due to partitioning but also the memory needed for the Partitioner by GeoSpark (based on index) and the timeout due to long geoquery calculations that have been solved adding the following parameters as suggested by GeoSpark website itself:

spark.executor.memory 4g
spark.driver.memory 10g
spark.network.timeout 10000s
spark.driver.maxResultSize 5g

这篇关于如何避免在GeoSpark的范围查询中超出gc开销限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆