Apache Spark是否从目标数据库加载整个数据? [英] Does Apache Spark load entire data from target database?
问题描述
我想使用Apache Spark并通过JDBC连接到Vertica.
I want to use Apache Spark and connect to Vertica by JDBC.
在Vertica数据库中,我有1亿条记录,并且spark代码在另一台服务器上运行.
In Vertica database, I have 100 million records and spark code runs on another server.
当我在Spark中运行查询并监视网络使用情况时,两台服务器之间的流量非常高.
When I run the query in Spark and monitor network usage, traffic between two servers is very high.
Spark似乎从目标服务器加载了所有数据.
It seems Spark loads all data from target server.
这是我的代码:
test_df = spark.read.format("jdbc")
.option("url" , url).option("dbtable", "my_table")
.option("user", "user").option("password" , "pass").load()
test_df.createOrReplaceTempView('tb')
data = spark.sql("select * from tb")
data.show()
运行此命令时,经过2分钟且网络使用率很高,结果返回.
when I run this, after 2 minutes and very high network usage, result returned.
Spark是否从目标数据库加载整个数据?
Does Spark load the entire data from target database?
推荐答案
在Spark作业完成后,使用与Spark作业使用并运行的相同凭据登录到Vertica数据库:
After your spark jobs finishes logon to the Vertica database using the same credentials that the spark job used and run:
SELECT * FROM v_monitor.query_requests ORDER BY start_timetamp DESC LIMIT 10000;
这将向您显示spark作业发送到数据库的查询,使您可以查看它是否将count(*)推送到数据库,或者是否确实尝试通过网络检索整个表.
This will show you the queries sent to the database by the spark job, allowing you to see if it pushed down the count(*) to the database or if it indeed tried to retrieve the entire table across the network.
这篇关于Apache Spark是否从目标数据库加载整个数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!