使用 Spark 将 CSV 转换为 ORC [英] Converting CSV to ORC with Spark

查看:49
本文介绍了使用 Spark 将 CSV 转换为 ORC的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看过这篇博文Hortonworks 通过数据源支持 Spark 1.2 中的 ORC.

I've seen this blog post by Hortonworks for support for ORC in Spark 1.2 through datasources.

它涵盖了 1.2 版,并解决了从对象创建 ORC 文件的问题,而不是从 csv 到 ORC 的转换.我也看到了方法,正如预期的那样,在 Hive 中进行这些转换.

It covers version 1.2 and it addresses the issue or creation of the ORC file from the objects, not conversion from csv to ORC. I have also seen ways, as intended, to do these conversions in Hive.

有人可以提供一个简单的示例,说明如何从 Spark 1.6+ 加载纯 csv 文件,将其另存为 ORC,然后将其作为数据帧加载到 Spark 中.

Could someone please provide a simple example for how to load plain csv file from Spark 1.6+, save it as ORC and then load it as a data frame in Spark.

推荐答案

我将省略 CSV 阅读部分,因为这个问题之前已经回答了很多时间,而且网络上有很多教程可用于此目的,再写一遍就大材小用了.如果需要,请点击此处

I'm going to ommit the CSV reading part because that question has been answered quite lots of time before and plus lots of tutorial are available on the web for that purpose, it will be an overkill to write it again. Check here if you want !

关于 ORC,HiveContext 支持它们.

Concerning ORCs, they are supported with the HiveContext.

HiveContext 是 Spark SQL 执行引擎的一个实例,它与存储在 Hive 中的数据集成.SQLContext 提供了 Spark SQL 支持的一个子集,它不依赖于 Hive,但 ORC、Window 函数和其他功能依赖于 HiveContext,后者从类路径上的 hive-site.xml 读取配置.

HiveContext is an instance of the Spark SQL execution engine that integrates with data stored in Hive. SQLContext provides a subset of the Spark SQL support that does not depend on Hive but ORCs, Window function and other feature depends on HiveContext which reads the configuration from hive-site.xml on the classpath.

你可以定义一个 HiveContext 如下:

You can define a HiveContext as following :

import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

如果您正在使用 spark-shell,您可以直接将 sqlContext 用于此目的,而无需创建 hiveContext,因为默认情况下,sqlContext 被创建为 HiveContext.

If you are working with the spark-shell, you can directly use sqlContext for such purpose without creating a hiveContext since by default, sqlContext is created as a HiveContext.

在下面的 SQL 语句末尾指定 as orc 可确保 Hive 表以 ORC 格式存储.例如:

Specifying as orc at the end of the SQL statement below ensures that the Hive table is stored in the ORC format. e.g :

val df : DataFrame = ???
df.registerTempTable("orc_table")
val results = hiveContext.sql("create table orc_table (date STRING, price FLOAT, user INT) stored as orc")

另存为 ORC 文件

让我们将 DataFrame 持久化到我们之前创建的 Hive ORC 表中.

Saving as an ORC file

Let’s persist the DataFrame into the Hive ORC table we created before.

results.write.format("orc").save("data_orc")

要将结果存储在 hive 目录而不是用户目录中,请改用此路径 /apps/hive/warehouse/data_orc(来自 hive-default.xml 的 hive 仓库路径)

To store results in a hive directory rather than user directory, use this path instead /apps/hive/warehouse/data_orc (hive warehouse path from hive-default.xml)

这篇关于使用 Spark 将 CSV 转换为 ORC的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆