如何阅读Java / Scala中的Nutch内容？ [英] How to read Nutch content from Java/Scala?

查看：138 发布时间：2018/6/1 12:35:33 java hadoop nutch

本文介绍了如何阅读Java / Scala中的Nutch内容？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Nutch抓取一些网站（作为一个独立运行的程序），而我想用Java（Scala）程序来分析使用Jsoup的网站的HTML数据。

b
$ b

我通过遵循教程（无该脚本，只执行个别指令工作），我认为它是在 crawl / segments /< time> / content / part-00000 目录中保存网站的HTML 。

问题在于我无法弄清楚如何在Java / Scala程序中实际读取网站数据（URL和HTML）。我阅读了这篇文档，但由于我从未使用过Hadoop，因此感觉有些压倒性。

我尝试将示例代码调整到我的环境中，这就是我到达的地方（主要是通过guesswprk）：

  val reader = new MapFile.Reader（FileSystem.getLocal（new Configuration（）），... / apache-nutch-1.8 / crawl / segments / 20140711115438 / content / part- 00000，new Configuration（））
 var key = null 
 var value = null 
 reader.next（key，value）//测试单个值
 println（key ）
 println（value）

然而，当我运行它时，

 线程main中的异常java.lang.NullPointerException $ b $ org.apache.hadoop.io.SequenceFile $ Reader.next（SequenceFile.java:1873）
 at org.apache.hadoop.io.MapFile $ Reader.next（MapFile.java:517）

我不确定h可以使用 MapFile.Reader ，特别是我应该传递给它的构造函数参数。什么配置对象我应该通过？这是正确的FileSystem？那是我感兴趣的数据文件吗？

解决方案

Scala：

  val conf = NutchConfiguration.create（）
 val fs = FileSystem.get（conf）
 val file = new Path （... / part-00000 / data）
 val reader = new SequenceFile.Reader（fs，file，conf）
 $ b $ val webdata = Stream.continually {
 val key = new Text（）
 val content = new Content（）
 reader.next（key，content）
（key，content）
} 
 
 println（webdata.head）

Java：

  public class ContentReader {
 public static void main（String [] args）throws IOException {
 Configuration conf = NutchConfiguration。创建（）; 
 Options opts = new Options（）; 
 GenericOptionsParser解析器=新的GenericOptionsParser（conf，opts，args）; 
 String [] remainingArgs = parser.getRemainingArgs（）; 
 FileSystem fs = FileSystem.get（conf）; 
 String segment = remainingArgs [0]; 
路径文件=新路径（segment，Content.DIR_NAME +/ part-00000 / data）; 
 SequenceFile.Reader reader = new SequenceFile.Reader（fs，file，conf）; 
 Text key = new Text（）; 
 Content content = new Content（）; 
 //循环序列文件
 while（reader.next（key，content））{
 try {
 System.out.write（content.getContent（），0， 
 content.getContent（）。length）; 
} catch（Exception e）{
} 
} 
} 
}

或者，您可以使用 org.apache.nutch.segment.SegmentReader （ example ）。

I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup.

I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory.

The problem is that I cannot figure out how to actually read the website data (URLs and HTML) in a Java/Scala program. I read this document, but find it a bit overwhelming since I've never used Hadoop.

I tried to adapt the example code to my environment, and this is what I arrived at (mostly by guesswprk):

  val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()), ".../apache-nutch-1.8/crawl/segments/20140711115438/content/part-00000", new Configuration())
  var key = null
  var value = null
  reader.next(key, value) // test for a single value
  println(key)
  println(value)

However, I am getting this exception when I run it:

Exception in thread "main" java.lang.NullPointerException
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1873)
    at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)

I am not sure how to work with a MapFile.Reader, specifically, what constructor parameters I am supposed to pass to it. What Configuration objects am I supposed to pass in? Is that the correct FileSystem? And is that the data file I'm interested in?

解决方案

Scala:

val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)

val webdata = Stream.continually {
  val key = new Text()
  val content = new Content()
  reader.next(key, content)
  (key, content)
}

println(webdata.head)

Java:

public class ContentReader {
    public static void main(String[] args) throws IOException { 
        Configuration conf = NutchConfiguration.create();       
        Options opts = new Options();       
        GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);       
        String[] remainingArgs = parser.getRemainingArgs();     
        FileSystem fs = FileSystem.get(conf);
        String segment = remainingArgs[0];
        Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
        Text key = new Text();
        Content content = new Content();
        // Loop through sequence files
        while (reader.next(key, content)) {
            try {
                System.out.write(content.getContent(), 0,
                        content.getContent().length);
            } catch (Exception e) {
            }
        }
    }
}

Alternatively, you can use org.apache.nutch.segment.SegmentReader (example).

这篇关于如何阅读Java / Scala中的Nutch内容？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何阅读Java / Scala中的Nutch内容？ [英] How to read Nutch content from Java/Scala?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何阅读Java / Scala中的Nutch内容？ [英] How to read Nutch content from Java/Scala?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭