从HDFS中逐行读取文本文件 [英] Read a text file from HDFS line by line in mapper

查看:1464
本文介绍了从HDFS中逐行读取文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的Mappers代码是从HDFS读取一个文本文件的吗?如果是这样的话:


  1. 如果两个不同节点的映射器几乎同时尝试打开文件,会发生什么情况?

  2. 是否需要关闭 InputStreamReader ?如果是这样,如何做到这一点而不关闭文件系统?

我的代码是:

 路径pt =新路径(hdfs:// pathTofile); 
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt)));
字符串行;
line = br.readLine();
while(line!= null){
System.out.println(line);


解决方案

您粘贴的代码仅被截断:

 路径pt = new Path(hdfs:// pathTofile); 
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt)));
尝试{
String line;
line = br.readLine();
while(line!= null){
System.out.println(line);

//一定要读下一行,否则你会得到一个无限循环
line = br.readLine();
}
} finally {
//你应该关掉BufferedReader
br.close();

$ / code>

你可以有多个mapper读取同一个文件,但是有限制在这种情况下使用分布式缓存更有意义(不仅减少承载文件块的数据节点上的负载,而且如果您的任务数量多于任务节点数量,则效率更高) )

Is the following code for Mappers, reading a text file from HDFS right? And if it is:

  1. What happens if two mappers in different nodes try to open the file at almost the same time?
  2. Isn't there a need to close the InputStreamReader? If so, how to do it without closing the filesystem?

My code is:

Path pt=new Path("hdfs://pathTofile");
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
String line;
line=br.readLine();
while (line != null){
System.out.println(line);

解决方案

This will work, with some amendments - i assume the code you've pasted is just truncated:

Path pt=new Path("hdfs://pathTofile");
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
try {
  String line;
  line=br.readLine();
  while (line != null){
    System.out.println(line);

    // be sure to read the next line otherwise you'll get an infinite loop
    line = br.readLine();
  }
} finally {
  // you should close out the BufferedReader
  br.close();
}

You can have more than one mapper reading the same file, but there is limit at which it makes more sense to use the Distributed Cache (not only reducing the load on the data nodes which host the blocks for the file but also will be more efficient if you have a job with a larger number of tasks than you have task nodes)

这篇关于从HDFS中逐行读取文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆