这是有效率的，数据框或RDD或hiveql？ [英] Which is efficient, Dataframe or RDD or hiveql?

查看：112 发布时间：2016/5/22 16:14:57 apache-spark apache-spark-sql

本文介绍了这是有效率的，数据框或RDD或hiveql？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是新手到Apache的火花。

I am newbie to Apache Spark.

我的工作是读过两年的CSV文件，从中选择一些特定列，将其合并，汇总，并把结果写入到一个CSV文件。

My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file.

例如，

姓名，年龄，deparment_id

name,age,deparment_id

部门标识，deparment_name，位置

department_id,deparment_name,location

姓名，年龄，deparment_name

name,age,deparment_name

我加载这两个CSV到dataframes。
然后能够得到使用几种方法加入，选择，过滤数据框第三，在数据帧下降present

I am loading both the CSV into dataframes. And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe

我也能够做到的使用几种RDD.map相同的（）

I am also able to do the same using several RDD.map()

和我也能够做到通过执行hiveql使用HiveContext相同

And I am also able to do the same using executing hiveql using HiveContext

我想知道这是有效的方式，如果我的CSV文件，是巨大的，为什么？

I want to know which is the efficient way if my CSV files are huge and why?

这是有效率的，数据框或RDD或hiveql？ [英] Which is efficient, Dataframe or RDD or hiveql?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

这是有效率的，数据框或RDD或hiveql？ [英] Which is efficient, Dataframe or RDD or hiveql?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭