直接从 Spark shell 读取 ORC 文件 [英] Read ORC files directly from Spark shell
问题描述
我在直接从 Spark shell 读取 ORC 文件时遇到问题.注意:运行Hadoop 1.2和Spark 1.2,使用pyspark shell,可以使用spark-shell(运行scala).
I am having issues reading an ORC file directly from the Spark shell. Note: running Hadoop 1.2, and Spark 1.2, using pyspark shell, can use spark-shell (runs scala).
I have used this resource http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_orc-spark-quickstart.html .
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
inputRead = sc.hadoopFile("hdfs://user@server:/file_path",
classOf[inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],
classOf[outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat])
我收到一条错误消息,通常说语法错误.有一次,代码似乎有效,我只使用了传递给 hadoopFile 的三个参数中的第一个,但是当我尝试使用
I get an error generally saying wrong syntax. One time, the code seemed to work, I used just the 1st of three arguments passed to hadoopFile, but when I tried to use
inputRead.first()
输出是 RDD[nothing, nothing].我不知道这是因为 inputRead 变量不是作为 RDD 创建的,还是根本没有创建.
the output was RDD[nothing, nothing]. I don't know if this is because the inputRead variable did not get created as an RDD or if it was not created at all.
感谢您的帮助!
推荐答案
在 Spark 1.5 中,我可以将我的 ORC 文件加载为:
In Spark 1.5, I'm able to load my ORC file as:
val orcfile = "hdfs:///ORC_FILE_PATH"
val df = sqlContext.read.format("orc").load(orcfile)
df.show
这篇关于直接从 Spark shell 读取 ORC 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!