本地加载Spark数据不完整的HDFS URI [英] Load Spark data locally Incomplete HDFS URI
问题描述
我在本地CSV文件中加载SBT时遇到问题.基本上,我已经在Scala Eclipse中编写了一个Spark程序,该程序读取以下文件:
I have experienced a problem with SBT loading in a local CSV file. Basically, I've written a Spark program in Scala Eclipse which reads the following file:
val searches = sc.textFile("hdfs:///data/searches")
这在hdfs上可以正常工作,但是出于调试的原因,我希望从本地目录中加载该文件,我已经将其设置在项目目录中.
This works fine on hdfs, but for de-bug reasons, I wish to load in this file from a local directory, which I have set-up to be in the project directory.
所以我厌倦了以下事情:
So I tired the following:
val searches = sc.textFile("file:///data/searches")
val searches = sc.textFile("./data/searches")
val searches = sc.textFile("/data/searches")
没有一个允许我从本地读取文件,并且所有人都在SBT上返回了此错误:
None of which allows me to read the file from local, and all of them returns this error on SBT:
Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: hdfs:/data/pages
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
at org.apache.spark.rdd.RDD.count(RDD.scala:904)
at com.user.Result$.get(SparkData.scala:200)
at com.user.StreamingApp$.main(SprayHerokuExample.scala:35)
at com.user.StreamingApp.main(SprayHerokuExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
在错误报告中,com.user.Result $ .get(SparkData.scala:200)是调用sc.textFile的行.默认情况下,它似乎在Hadoop环境中运行.我有什么办法可以在本地读取此文件?
In the error report, at com.user.Result$.get(SparkData.scala:200) is the line where sc.textFile is called. It seems to run in Hadoop environment by default. Is there anything I could do to read this file locally?
在本地时,我使用以下命令重新配置了build.sbt:
While on Local, I've reconfigured build.sbt with:
submit <<= inputTask{(argTask:TaskKey[Seq[String]]) => {
(argTask,mainClass in Compile,assemblyOutputPath in assembly,sparkHome) map {
(args,main,jar,sparkHome) => {
args match {
case List(output) => {
val sparkCmd = sparkHome+"/bin/spark-submit"
Process(
sparkCmd :: "--class" :: main.get :: "--master" :: "local[4]" ::
jar.getPath :: "local[4]" :: output :: Nil)!
}
case _ => Process("echo" :: "Usage" :: Nil) !
}
}
}}}
submit命令是我用来运行代码的命令.
The submit command is what I use to run the code.
找到的解决方案:事实证明file:///path/是执行此操作的正确方法,但就我而言,完整路径有效:即home/projects/data/searches.尽管仅放置数据/搜索没有(尽管在home/projects目录下工作).
Solution Found: So it turns out that file:///path/ is the correct way to do it, but in my case, the full path worked: i.e. home/projects/data/searches. While just putting data/searches did not (despite working under home/projects directory).
推荐答案
这应该有效:
sc.textFile("file:///data/searches")
从您的错误看来,似乎火花正在加载Hadoop配置,当您拥有Hadoop conf文件或Hadoop环境变量集(例如HADOOP_CONF_DIR)时,这可能会很复杂
from you error it seems like spark is loading Hadoop config, this can accure when you have a Hadoop conf file or a Hadoop environment variable set (like HADOOP_CONF_DIR)
这篇关于本地加载Spark数据不完整的HDFS URI的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!