加载本地文件中使用sc.textFile火花() [英] load a local file to spark using sc.textFile()

查看:4436
本文介绍了加载本地文件中使用sc.textFile火花()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

如何从本地文件系统加载的文件中的sc.textFile星火?我是否需要更改任何-env变量?此外,当我试图在我的Windows相同的Hadoop的地方没有安装,我得到了同样的错误。

code

 > VAL INPUTFILE = sc.textFile(文件/// C:/用户/ swaapnika /桌面/待办事项清单)
/ 17 22点28分18秒INFO MemoryStore的:ensureFreeSpace(63280)调用curMem = 0,MAXMEM = 278019440
/ 17 22点28分18秒INFO MemoryStore的:阻止broadcast_0存储在内存中的值(估计大小61.8 KB,免费265.1 MB)
/ 17 22点28分18秒INFO MemoryStore的:ensureFreeSpace(19750)调用curMem = 63280,MAXMEM = 278019440
/ 17 22点28分18秒INFO MemoryStore的:阻止broadcast_0_piece0存储在内存中的字节(估计大小19.3 KB,免费265.1 MB)
/ 17 22点28分18秒INFO BlockManagerInfo:新增broadcast_0_piece0内存在localhost:53659(尺寸:19.3 KB,自由:265.1 MB)
/ 17 22点28分18秒INFO SparkContext:AT&LT从文本文件创建广播0;&控制台GT;:21
文件:org.apache.spark.rdd.RDD [字符串] = MapPartitionsRDD [1]在文本文件AT<&控制台GT;:21> VAL字= input.flatMap(行=> line.split())
OLE>:19:错误:未找到:输入值
  VAL字= input.flatMap(行=> line.split())
              ^> VAL字= inputFile.flatMap(行=> line.split())
:org.apache.spark.rdd.RDD [字符串] = MapPartitionsRDD [2]在flatMap在&下;控制台>:23> VAL计数= words.map(字=>(字,1)){reduceByKey的情况下(X,Y)=> X + Y}

错误

  apache.hadoop.ma pred.InvalidInputException:输入路径不存在:文件:/ C:/spark-1.4.1-bin-hadoop2.6/bin/文件/ C:/用户/ swaapnika /桌面/待办事项清单
   在org.apache.hadoop.ma pred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
   在org.apache.hadoop.ma pred.FileInputFormat.listStatus(FileInputFormat.java:228)
   在org.apache.hadoop.ma pred.FileInputFormat.getSplits(FileInputFormat.java:313)
   在org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
   在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:219)
   在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:217)
   在scala.Option.getOrElse(Option.scala:120)
   在org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   在org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:219)
   在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:217)
   在scala.Option.getOrElse(Option.scala:120)
   在org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   在org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:219)
   在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:217)
   在scala.Option.getOrElse(Option.scala:120)
   在org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   在org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:219)
   在org.apache.spark.rdd.RDD $$ anonfun $ $分区2.适用(RDD.scala:217)
   在scala.Option.getOrElse(Option.scala:120)
   在org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   在org.apache.spark.Partitioner $ .defaultPartitioner(Partitioner.scala:65)
   在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ reduceByKey $ 3.apply(PairRDDFunctions.scala:290)
   在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ reduceByKey $ 3.apply(PairRDDFunctions.scala:290)
   在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:147)
   在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:108)
   在org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
   在org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289)
   在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ $$ IWC万国表$$ $$ IWC万国表$$ LT&;&初始化GT;(小于控制台>:25)。
   在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ $$ IWC万国表$$ $$ IWC万国表<&初始化GT;(小于控制台>:30)。
   在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ $$ IWC万国表$$ LT&;&初始化GT;(小于控制台>:32)。
   在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC万国表$$ $$ IWC万国表<&初始化GT;(小于控制台>:34)。
   在IWC万国表$ $$ $$ IWC万国表IWC万国表IWC $$ LT&;&初始化GT;(小于控制台>:36)。
   在IWC万国表$ $$ $$ IWC万国表IWC万国表<&初始化GT;(小于控制台>:38)。
   在IWC万国表$ $$ IWC万国表<&初始化GT;(小于控制台>:40)。
   在IWC万国表$<&初始化GT;(小于控制台>:42)。
   在与下;初始化>(小于控制台>:44)
   在与下;初始化>(小于控制台>:48)。
   在与下; clinit>(小于控制台&GT)
   在与下;初始化方式>(小于控制台>:7)
   在与下; clinit>(小于控制台&GT)
   在$打印(小于控制台>)
   在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)
   在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   在java.lang.reflect.Method.invoke(Method.java:497)
   在org.apache.spark.repl.SparkIMain $ ReadEvalPrint.call(SparkIMain.scala:1065)
   在org.apache.spark.repl.SparkIMain $ Request.loadAndRun(SparkIMain.scala:1338)
   1 org.apache.spark.repl.SparkIMain.loadAndRunReq $(SparkIMain.scala:840)
   在org.apache.spark.repl.SparkIMain.inter preT(SparkIMain.scala:871)
   在org.apache.spark.repl.SparkIMain.inter preT(SparkIMain.scala:819)
   在org.apache.spark.repl.SparkILoop.reallyInter preT $ 1(SparkILoop.scala:857)
   在org.apache.spark.repl.SparkILoop.inter pretStartingWith(SparkILoop.scala:902)
   在org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
   1 org.apache.spark.repl.SparkILoop.processLine $(SparkILoop.scala:657)
   1 org.apache.spark.repl.SparkILoop.innerLoop $(SparkILoop.scala:665)
   在org.apache.spark.repl.SparkILoop.org $阿帕奇$火花$ REPL $ SparkILoop $$环(SparkILoop.scala:670)
   在org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
   在org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   在org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   在scala.tools.nsc.util.ScalaClassLoader $ .savingContextLoader(ScalaClassLoader.scala:135)
   在org.apache.spark.repl.SparkILoop.org $阿帕奇$火花$ REPL $ SparkILoop $$过程(SparkILoop.scala:945)
   在org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
   在org.apache.spark.repl.Main $。主要(Main.scala:31)
   在org.apache.spark.repl.Main.main(Main.scala)
   在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)
   在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   在java.lang.reflect.Method.invoke(Method.java:497)
   在org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
   在org.apache.spark.deploy.SparkSubmit $ .doRunMain $ 1(SparkSubmit.scala:170)
   在org.apache.spark.deploy.SparkSubmit $ .submit(SparkSubmit.scala:193)
   在org.apache.spark.deploy.SparkSubmit $。主要(SparkSubmit.scala:112)
   在org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>


解决方案

您已经定义的文件路径不正确。

尝试更改

  sc.textFile(文件/// C:/用户/ swaapnika /桌面/待办事项清单)

  sc.textFile(文件:// C:/用户/ swaapnika /桌面/待办事项清单)

  sc.textFile(C:/用户/ swaapnika /桌面/待办事项清单)

Question

How to load a file from the local file system to Spark using sc.textFile? Do I need to change any -env variables? Also when I tried the same on my windows where Hadoop is not installed I got the same error.

Code

> val inputFile = sc.textFile("file///C:/Users/swaapnika/Desktop/to do list")
/17 22:28:18 INFO MemoryStore: ensureFreeSpace(63280) called with curMem=0, maxMem=278019440
/17 22:28:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 61.8 KB, free 265.1 MB)
/17 22:28:18 INFO MemoryStore: ensureFreeSpace(19750) called with curMem=63280, maxMem=278019440
/17 22:28:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.3 KB, free 265.1 MB)
/17 22:28:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:53659 (size: 19.3 KB, free: 265.1 MB)
/17 22:28:18 INFO SparkContext: Created broadcast 0 from textFile at <console>:21
File: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

> val words = input.flatMap(line => line.split(" "))
ole>:19: error: not found: value input
  val words = input.flatMap(line => line.split(" "))
              ^

> val words = inputFile.flatMap(line => line.split(" "))
: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:23

> val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}

Error

apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/c:/spark-1.4.1-bin-hadoop2.6/bin/file/C:/Users/swaapnika/Desktop/to do list
   at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
   at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
   at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
   at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)
   at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:290)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
   at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:289)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
   at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:34)
   at $iwC$$iwC$$iwC$$iwC.<init>(<console>:36)
   at $iwC$$iwC$$iwC.<init>(<console>:38)
   at $iwC$$iwC.<init>(<console>:40)
   at $iwC.<init>(<console>:42)
   at <init>(<console>:44)
   at .<init>(<console>:48)
   at .<clinit>(<console>)
   at .<init>(<console>:7)
   at .<clinit>(<console>)
   at $print(<console>)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:497)
   at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
   at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
   at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
   at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
   at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
   at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
   at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
   at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
   at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:497)
   at 

org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
   at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


>

解决方案

The file path you have defined is incorrect.

Try changing

sc.textFile("file///C:/Users/swaapnika/Desktop/to do list")

to

sc.textFile("file://C:/Users/swaapnika/Desktop/to do list")

or

sc.textFile("C:/Users/swaapnika/Desktop/to do list") 

这篇关于加载本地文件中使用sc.textFile火花()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆