如何使用Spark快速从map()中的HDFS中读取文件 [英] How to read a file from HDFS in map() quickly with Spark

查看:504
本文介绍了如何使用Spark快速从map()中的HDFS中读取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在每个map()中读取一个不同的文件,该文件位于HDFS中

I need to read a different file in every map() ,the file is in HDFS

  val rdd=sc.parallelize(1 to 10000)
  val rdd2=rdd.map{x=>
    val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://ITS-Hadoop10:9000/"), new org.apache.hadoop.conf.Configuration())
    val path=new Path("/user/zhc/"+x+"/")
    val t=hdfs.listStatus(path)
    val in =hdfs.open(t(0).getPath)
    val reader = new BufferedReader(new InputStreamReader(in))
    var l=reader.readLine()
  }
 rdd2.count

我的问题是此代码

val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://ITS-Hadoop10:9000/"), new org.apache.hadoop.conf.Configuration())

占用了太多的运行时间,每次map()都需要创建一个新的FileSystem值.我可以将这段代码放在map()函数之外,这样就不必每次都创建hdfs了吗?或者如何在map()中快速读取文件?

takes too much running time, every time of map() needs to create a new FileSystem value. Can i put this code outside map() function so it doesn't have to create hdfs every time? Or how can i read files quickly in map()?

我的代码在多台计算机上运行.谢谢!

My code runs on multiple machines. Thank you!

推荐答案

在您的情况下,我建议使用wholeTextFiles方法,该方法将返回pairRdd,其键为文件完整路径,而值则为该文件以字符串形式.

In your case, I recommend the use of wholeTextFiles method wich will return pairRdd with the key is the file full path, and the value is the content of the file in string.

val filesPariRDD = sc.wholeTextFiles("hdfs://ITS-Hadoop10:9000/")
val filesLineCount = filesPariRDD.map( x => (x._1, x._2.length ) ) //this will return a map of fileName , number of lines of each file. You could apply any other function on the file contents
filesLineCount.collect() 

修改

如果文件位于同一目录下(如注释中所述),则可以使用某种正则表达式

If your files are in directories which are under the same directory ( as mentioned in comments)you could use some kind of regular expression

val filesPariRDD = sc.wholeTextFiles("hdfs://ITS-Hadoop10:9000/*/"

希望这是清楚而有用的

这篇关于如何使用Spark快速从map()中的HDFS中读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆