Apache Spark是否从目标数据库加载整个数据? [英] Does Apache Spark load entire data from target database?

查看:86
本文介绍了Apache Spark是否从目标数据库加载整个数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Apache Spark并通过JDBC连接到Vertica.

I want to use Apache Spark and connect to Vertica by JDBC.

在Vertica数据库中,我有1亿条记录,并且spark代码在另一台服务器上运行.

In Vertica database, I have 100 million records and spark code runs on another server.

当我在Spark中运行查询并监视网络使用情况时,两台服务器之间的流量非常高.

When I run the query in Spark and monitor network usage, traffic between two servers is very high.

Spark似乎从目标服务器加载了所有数据.

It seems Spark loads all data from target server.

这是我的代码:

test_df = spark.read.format("jdbc")
    .option("url" , url).option("dbtable", "my_table")
    .option("user", "user").option("password" , "pass").load()

test_df.createOrReplaceTempView('tb')

data = spark.sql("select * from tb")

data.show()

运行此命令时,经过2分钟且网络使用率很高,结果返回.

when I run this, after 2 minutes and very high network usage, result returned.

Spark是否从目标数据库加载整个数据?

Does Spark load the entire data from target database?

推荐答案

在Spark作业完成后,使用与Spark作业使用并运行的相同凭据登录到Vertica数据库:

After your spark jobs finishes logon to the Vertica database using the same credentials that the spark job used and run:

SELECT * FROM v_monitor.query_requests ORDER BY start_timetamp DESC LIMIT 10000;

这将向您显示spark作业发送到数据库的查询,使您可以查看它是否将count(*)推送到数据库,或者是否确实尝试通过网络检索整个表.

This will show you the queries sent to the database by the spark job, allowing you to see if it pushed down the count(*) to the database or if it indeed tried to retrieve the entire table across the network.

这篇关于Apache Spark是否从目标数据库加载整个数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆