哪个是有效的，Dataframe 或 RDD 或 hiveql? [英] Which is efficient, Dataframe or RDD or hiveql?

查看：29 发布时间：2021/11/14 22:29:42 apache-spark apache-spark-sql spark-dataframe

本文介绍了哪个是有效的，Dataframe 或 RDD 或 hiveql?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 Apache Spark 的新手.

I am newbie to Apache Spark.

我的工作是读取两个 CSV 文件，从中选择一些特定的列、合并、聚合并将结果写入单个 CSV 文件.

My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file.

例如，

name,age,deparment_id

CSV2

department_id,deparment_name,location

我想获得第三个 CSV 文件

name,age,deparment_name

我正在将两个 CSV 加载到数据帧中.然后能够使用数据帧中存在的几种方法 join,select,filter,drop 获取第三个数据帧

I am loading both the CSV into dataframes. And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe

我也可以使用多个 RDD.map()

而且我也可以使用 HiveContext

如果我的 CSV 文件很大，我想知道哪种方法有效?为什么?

I want to know which is the efficient way if my CSV files are huge and why?

哪个是有效的，Dataframe 或 RDD 或 hiveql? [英] Which is efficient, Dataframe or RDD or hiveql?

问题描述

CSV2

我想获得第三个 CSV 文件

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

哪个是有效的，Dataframe 或 RDD 或 hiveql? [英] Which is efficient, Dataframe or RDD or hiveql?

问题描述

CSV2

我想获得第三个 CSV 文件

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭