直接从Spark shell读取ORC文件 [英] Read ORC files directly from Spark shell

查看：3778 发布时间：2018/5/31 18:56:57 scala hadoop apache-spark hive pyspark

本文介绍了直接从Spark shell读取ORC文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我直接从Spark shell读取ORC文件时遇到问题。注意：使用pyspark shell运行Hadoop 1.2和Spark 1.2可以使用spark-shell（运行scala）。

I am having issues reading an ORC file directly from the Spark shell. Note: running Hadoop 1.2, and Spark 1.2, using pyspark shell, can use spark-shell (runs scala).

我已经使用了这个资源 http://docs.hortonworks.com/HDPDocuments/HDP2 /HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html 。

I have used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html .

from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)

inputRead = sc.hadoopFile("hdfs://user@server:/file_path",
classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],
classOf[outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat])

我得到的错误通常是错误的语法。有一次，代码似乎工作，我只使用传递给hadoopFile的三个参数中的第一个，但是当我试图使用

I get an error generally saying wrong syntax. One time, the code seemed to work, I used just the 1st of three arguments passed to hadoopFile, but when I tried to use

inputRead.first()

输出是RDD [nothing，nothing]。我不知道这是因为inputRead变量没有创建为RDD，或者它根本就没有创建。

the output was RDD[nothing, nothing]. I don't know if this is because the inputRead variable did not get created as an RDD or if it was not created at all.

感谢任何帮助！

推荐答案

在Spark 1.5中，我可以加载我的ORC文件为：

In Spark 1.5, I'm able to load my ORC file as:

val orcfile = "hdfs:///ORC_FILE_PATH" val df = sqlContext.read.format("orc").load(orcfile) df.show

这篇关于直接从Spark shell读取ORC文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

直接从Spark shell读取ORC文件 [英] Read ORC files directly from Spark shell

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

直接从Spark shell读取ORC文件 [英] Read ORC files directly from Spark shell

问题描述

推荐答案

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭