Spark JDBC fetchsize 选项 [英] Spark JDBC fetchsize option

查看:103
本文介绍了Spark JDBC fetchsize 选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有一个应用程序,它应该连接到不同类型的数据库,使用 Spark 的 JDBC 选项对该数据库运行特定查询,然后将生成的 DataFrame 写入 HDFS.

I currently have an application which is supposed to connect to different types of databases, run a specific query on that database using Spark's JDBC options and then write the resultant DataFrame to HDFS.

Oracle 的性能非常糟糕(没有检查所有这些).原来这是因为 fetchSize 属性对于 Oracle 来说默认是 10 行.所以我将它增加到 1000,性能提升非常明显.然后,我将其更改为 10000,但随后一些表开始因执行程序中的内存不足问题而失败(6 个执行程序,每个 4G 内存,2G 驱动程序内存).

The performance was extremely bad for Oracle (didn't check for all of them). Turns out it was because of the fetchSize property which is 10 rows by default for Oracle. So I increased it to 1000 and the performance gain was quite visible. Then, I changed it to 10000 but then some of the tables started failing with an out of memory issue in the executor ( 6 executors, 4G memory each, 2G driver memory ).

我的问题是:

  • Spark 的 JDBC 获取的数据是否在每次运行时都保存在执行程序内存中?有什么办法可以在作业运行时取消持久化它?

从哪里可以获得有关 fetchSize 属性的更多信息?我猜不是所有 JDBC 驱动程序都支持它.

Where can I get more information about the fetchSize property? I'm guessing it won't be supported by all JDBC drivers.

还有什么我需要注意的与 JDBC 相关的事情以避免 OOM 错误吗?

Are there any other things that I need to take care which are related to JDBC to avoid OOM errors?

推荐答案

Fetch Size 这只是 JDBC PreparedStatement 的一个值.

Fetch Size It's just a value for JDBC PreparedStatement.

你可以在JDBCRDD.scala中看到:

You can see it in JDBCRDD.scala:

 stmt.setFetchSize(options.fetchSize)

您可以阅读有关 JDBC FetchSize 的更多信息 这里

You can read more about JDBC FetchSize here

您还可以改进的一件事是设置所有 4 个参数,这将导致读取并行化.在此处查看更多信息.那么你的阅读可以拆分到多台机器上,这样每台机器的内存使用量可能会更小.

One thing you can also improve is to set all 4 parameters, that will cause parallelization of reading. See more here. Then your reading can be splitted into many machines, so memory usage for every of them may be smaller.

有关支持哪些 JDBC 选项以及如何支持的详细信息,您必须搜索驱动程序文档 - 每个驱动程序可能都有自己的行为

For details which JDBC Options are supported and how, you must search for your Driver documentation - every driver may have it's own behaviour

这篇关于Spark JDBC fetchsize 选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆