在Oracle中使用Spark时的数据加载时间 [英] data load time when using spark with oracle

查看:201
本文介绍了在Oracle中使用Spark时的数据加载时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从oracle加载数据以在juypter笔记本中启动火花.但是每次尝试绘制图形时,花费的时间都是巨大的.如何使其更快?

I am trying to load data from oracle to spark in juypter notebook.But each time I try to pot graph the time taken is huge. How do I make it faster?

query = "(select * from db.schema where lqtime between trunc(sysdate)-30 and trunc(sysdate) )"
%time df = sqlContext.read.format('jdbc').options(url="jdbc:oracle:thin:useradmin/pass12@//localhost:1521/aldb",dbtable=query,driver="oracle.jdbc.OracleDriver").load()

现在我按节点分组:

%time fo_node = df.select('NODE').groupBy('NODE').count().sort('count',ascending=False)
%time fo_node.show(10)

每次运行此命令,加载时间为4m或更长.

The load time is 4m or more each time I run this.

推荐答案

针对关系数据库使用hadoop或apache spark是一种反模式,因为数据库一次接收到太多连接,并尝试一次响应太多请求.磁盘肯定不堪重负-即使有了索引和分区,它也会一次读取每个分区的数据.我敢打赌,那里有HDD,这就是为什么它真的很慢的原因.

Using hadoop or apache spark against relational database is an anti-pattern, as database receives too many connections at once and tries to respond to too many requests at once. The disk most certainly is overwhelmed - even with index and partitioning, it reads data for each partition at once. I bet you have HDD there and I'd say that's the reason why it's really slow.

要加快加载速度,您可以尝试

To speed loading up you can try

  1. 实际上减少了要加载的分区数.稍后,您可以对其进行改组并提高并行度.

  1. to actually reduce number of partitions for loading. Later on you can reshuffle it and increase parallelism.

以选择特定字段而不是*.即使您只需要一个,也可以有所作为.

to select specific fields instead of *. Even if you need all, but one, it can make a difference.

如果您在数据框上运行少量操作,则可以对其进行缓存.如果群集上没有足够的内存,则也可以使用选项在磁盘上进行缓存.

if you run few actions on the Dataframe, it make sense to cache that. If you don't have enough memory on cluster, then use options to cache on disk, too.

可将所有内容从Oracle导出到磁盘,并将其读取为纯csv文件,然后像在集群查询中一样处理它.

to export everything from Oracle to disk and read it as a plain csv file, processing it as you do right now in a query on a cluster.

这篇关于在Oracle中使用Spark时的数据加载时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆