Spark JDBC fetchsize选项 [英] Spark JDBC fetchsize option

查看:1316
本文介绍了Spark JDBC fetchsize选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有一个应用程序,该应用程序应该连接到不同类型的数据库,使用Spark的JDBC选项在该数据库上运行特定的查询,然后将生成的DataFrame写入HDFS.

I currently have an application which is supposed to connect to different types of databases, run a specific query on that database using Spark's JDBC options and then write the resultant DataFrame to HDFS.

对于Oracle而言,性能极其糟糕(未检查所有参数).事实证明,这是因为fetchSize属性对于Oracle默认为10行.因此我将其增加到1000,并且性能提升非常明显.然后,我将其更改为10000,但随后某些表开始由于执行程序中的内存不足问题而失败(6个执行程序,每个执行器4G内存,2G驱动程序内存).

The performance was extremely bad for Oracle (didn't check for all of them). Turns out it was because of the fetchSize property which is 10 rows by default for Oracle. So I increased it to 1000 and the performance gain was quite visible. Then, I changed it to 10000 but then some of the tables started failing with an out of memory issue in the executor ( 6 executors, 4G memory each, 2G driver memory ).

我的问题是:

  • Spark的JDBC所获取的数据是否在每次运行时都保存在执行程序内存中?作业运行时,有什么方法可以不持久?

在哪里可以获取有关fetchSize属性的更多信息?我猜不是所有的JDBC驱动程序都支持它.

Where can I get more information about the fetchSize property? I'm guessing it won't be supported by all JDBC drivers.

为了避免OOM错误,我还有其他需要注意的与JDBC相关的事情吗?

Are there any other things that I need to take care which are related to JDBC to avoid OOM errors?

推荐答案

获取大小这只是JDBC PreparedStatement的值.

Fetch Size It's just a value for JDBC PreparedStatement.

您可以在JDBCRDD.scala中看到它:

You can see it in JDBCRDD.scala:

 stmt.setFetchSize(options.fetchSize)

您可以阅读有关JDBC FetchSize的更多信息

You can read more about JDBC FetchSize here

您还可以改进的一件事是设置所有4个参数,这将导致读取并行化.在此处中查看更多.然后,您的阅读内容可以分为许多机器,因此每台机器的内存使用量可能会较小.

One thing you can also improve is to set all 4 parameters, that will cause parallelization of reading. See more here. Then your reading can be splitted into many machines, so memory usage for every of them may be smaller.

有关支持哪些JDBC选项以及如何支持的详细信息,您必须搜索驱动程序文档-每个驱动程序可能都有其自己的行为

For details which JDBC Options are supported and how, you must search for your Driver documentation - every driver may have it's own behaviour

这篇关于Spark JDBC fetchsize选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆