转换CSV到ORC与星火 [英] Converting CSV to ORC with Spark

查看:877
本文介绍了转换CSV到ORC与星火的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经通过Hortonworks看到这篇博客文章通过数据源的ORC支持星火1.2。

I've seen this blog post by Hortonworks for support for ORC in Spark 1.2 through datasources.

它涵盖了1.2版,它从对象,而不是转换,从CSV到ORC解决了上述问题或ORC文件的创建。
我也看到<一个href=\"http://stackoverflow.com/questions/25117760/how-to-convert-txt-csv-file-to-orc-format\">ways,按计划,要做到在蜂巢这些转换。

It covers version 1.2 and it addresses the issue or creation of the ORC file from the objects, not conversion from csv to ORC. I have also seen ways, as intended, to do these conversions in Hive.

可能有人请提供如何从星火1.6+装入普通csv文件,将其保存为ORC,然后加载它作为星火数据帧一个简单的例子。

Could someone please provide a simple example for how to load plain csv file from Spark 1.6+, save it as ORC and then load it as a data frame in Spark.

推荐答案

我要ommit的CSV阅读部分,因为这个问题已经回答了相当大量的时间前后加上大量的教程可在网上为上目的,这将是再次写入矫枉过正。 检查这里如果你想的!

I'm going to ommit the CSV reading part because that question has been answered quite lots of time before and plus lots of tutorial are available on the web for that purpose, it will be an overkill to write it again. Check here if you want !

关于兽人,他们支持与HiveContext。

Concerning ORCs, they are supported with the HiveContext.

HiveContext是与存储在蜂房数据集成了火花SQL执行引擎的一个实例。 SQLContext提供星火SQL支持,不依赖于蜂巢,但兽人,窗口功能等功能取决于HiveContext其内容的类路径从蜂巢-site.xml中配置的一个子集。

HiveContext is an instance of the Spark SQL execution engine that integrates with data stored in Hive. SQLContext provides a subset of the Spark SQL support that does not depend on Hive but ORCs, Window function and other feature depends on HiveContext which reads the configuration from hive-site.xml on the classpath.

您可以定义一个HiveContext如下:

You can define a HiveContext as following :

import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

如果您正在使用的火花壳工作,你可以直接使用sqlContext这样的目的,没有,因为在默认情况下创建一个hiveContext,sqlContext创建为HiveContext。

If you are working with the spark-shell, you can directly use sqlContext for such purpose without creating a hiveContext since by default, sqlContext is created as a HiveContext.

指定兽人为在SQL语句的结尾确保蜂巢表存储在ORC格式。例如:

Specifying as orc at the end of the SQL statement below ensures that the Hive table is stored in the ORC format. e.g :

val df : DataFrame = ???
df.registerTempTable("orc_table")
val results = hiveContext.sql("create table orc_table (date STRING, price FLOAT, user INT) stored as orc")

保存为文件ORC

让我们持续数据框到我们之前创建的蜂巢ORC表。

Saving as an ORC file

Let’s persist the DataFrame into the Hive ORC table we created before.

results.write.format("orc").save("data_orc")

要存储结果在蜂巢目录,而不是用户目录,使用此路径,而不是 /应用/蜂巢/仓储/ data_orc (蜂巢仓库从蜂房default.xml中路径)

To store results in a hive directory rather than user directory, use this path instead /apps/hive/warehouse/data_orc (hive warehouse path from hive-default.xml)

这篇关于转换CSV到ORC与星火的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆