编写SQL与在Spark SQL中使用Dataframe API [英] Writing SQL vs using Dataframe APIs in Spark SQL

查看:363
本文介绍了编写SQL与在Spark SQL中使用Dataframe API的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark SQL世界中的一只新蜜蜂.我目前正在迁移应用程序的Ingestion代码,包括在HDFS的阶段,原始和应用程序层中提取数据并进行CDC(更改数据捕获),这目前是用Hive查询编写的,并通过Oozie执行.这需要迁移到Spark应用程序(当前版本1.6)中.代码的另一部分稍后将迁移.

I am a new bee in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.

在spark-SQL中,我可以直接从Hive中的表创建数据帧,并直接按原样执行查询(如sqlContext.sql("my hive hql")).另一种方法是使用数据帧API并以这种方式重写hql.

In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like sqlContext.sql("my hive hql") ). The other way would be to use dataframe APIs and rewrite the hql in that way.

这两种方法有什么区别?

What is the difference in these two approaches?

使用Dataframe API是否会提高性能?

Is there any performance gain with using Dataframe APIs?

有人建议,直接使用"SQL"查询时,火花核心引擎必须经过一个额外的SQL层,这可能会在一定程度上影响性能,但我没有发现任何能证实该语句的材料.我知道使用Datafrmae API可以使代码紧凑得多,但是当我方便地进行hql查询时,是否真的值得将完整的代码编写到Dataframe API中?

Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using "SQL" queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?

谢谢.

推荐答案

问题:这两种方法有什么区别? 使用Dataframe API可以提高性能吗?

Question : What is the difference in these two approaches? Is there any performance gain with using Dataframe APIs?


答案:


Answer :

霍顿的作品进行了比较研究. ...

There is comparative study done by horton works. source...

要点是根据情况/场景来确定的.没有 艰难而快速的规则来决定这一点.请通过下面.

Gist is based on situation/scenario each one is right. there is no hard and fast rule to decide this. pls go through below..

RDD,DataFrame和SparkSQL(实际上3的方法不只是2):

Spark的核心是弹性分布式数据集或RDD的概念:

RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):

At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:

  • 弹性-如果内存中的数据丢失,则可以重新创建
  • Distributed-分布在群集中多个数据节点之间的内存中对象的不可变分布式集合
  • 数据集-初始数据可以来自文件,也可以通过编程方式创建,也可以来自内存中的数据,也可以来自另一个RDD

DataFrames API是一个数据抽象框架,可将您的数据组织到命名列中:

DataFrames API is a data abstraction framework that organizes your data into named columns:

  • 为数据创建架构
  • 在概念上等同于关系数据库中的表
  • 可以从许多来源构建,包括结构化数据文件,Hive中的表,外部数据库或现有的RDDs.
  • 为简单的SQL(例如数据操作和聚合)提供数据的关系视图
  • 引擎盖下,它是Row的RDD

SparkSQL是用于结构化数据处理的Spark模块.您可以通过以下方式与SparkSQL进行交互:

SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:

  • SQL
  • DataFrames API
  • 数据集API
  • RDD在某些类型的数据处理方面胜过DataFrames和SparkSQL
  • DataFrames和SparkSQL的性能几乎相同,尽管在涉及聚合和排序的分析中,SparkSQL略有优势

  • RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
  • DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage

从语法上讲,DataFrame和SparkSQL比使用RDD的直观得多

Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s

每项测试中要从3个中选出最好的一个

Took the best out of 3 for each test

时间是一致的,两次测试之间没有太大的差异

Times were consistent and not much variation between tests

作业是单独运行的,没有其他作业在运行

Jobs were run individually with no other jobs running

从900万个唯一订单ID中针对1个订单ID进行随机查找 按产品名称对所有不同产品进行分组,其总计数和降序排列

Random lookup against 1 order ID from 9 Million unique order ID's GROUP all the different products with their total COUNTS and SORT DESCENDING by product name

这篇关于编写SQL与在Spark SQL中使用Dataframe API的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆