如何阅读Java / Scala中的Nutch内容? [英] How to read Nutch content from Java/Scala?
问题描述
我正在使用Nutch抓取一些网站(作为一个独立运行的程序),而我想用Java(Scala)程序来分析使用Jsoup的网站的HTML数据。
b$ b
我通过遵循教程(无该脚本,只执行个别指令工作),我认为它是在 crawl / segments /< time> / content / part-00000
目录中保存网站的HTML 。
问题在于我无法弄清楚如何在Java / Scala程序中实际读取网站数据(URL和HTML)。我阅读了这篇文档,但由于我从未使用过Hadoop,因此感觉有些压倒性。
我尝试将示例代码调整到我的环境中,这就是我到达的地方(主要是通过guesswprk):
val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()),... / apache-nutch-1.8 / crawl / segments / 20140711115438 / content / part- 00000,new Configuration())
var key = null
var value = null
reader.next(key,value)//测试单个值
println(key )
println(value)
然而,当我运行它时,
线程main中的异常java.lang.NullPointerException $ b $ org.apache.hadoop.io.SequenceFile $ Reader.next(SequenceFile.java:1873)
at org.apache.hadoop.io.MapFile $ Reader.next(MapFile.java:517)
我不确定h可以使用 MapFile.Reader
,特别是我应该传递给它的构造函数参数。什么配置对象我应该通过?这是正确的FileSystem?那是我感兴趣的数据文件吗?
Scala:
val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path (... / part-00000 / data)
val reader = new SequenceFile.Reader(fs,file,conf)
$ b $ val webdata = Stream.continually {
val key = new Text()
val content = new Content()
reader.next(key,content)
(key,content)
}
println(webdata.head)
Java:
public class ContentReader {
public static void main(String [] args)throws IOException {
Configuration conf = NutchConfiguration。创建();
Options opts = new Options();
GenericOptionsParser解析器=新的GenericOptionsParser(conf,opts,args);
String [] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs [0];
路径文件=新路径(segment,Content.DIR_NAME +/ part-00000 / data);
SequenceFile.Reader reader = new SequenceFile.Reader(fs,file,conf);
Text key = new Text();
Content content = new Content();
//循环序列文件
while(reader.next(key,content)){
try {
System.out.write(content.getContent(),0,
content.getContent()。length);
} catch(Exception e){
}
}
}
}
或者,您可以使用 org.apache.nutch.segment.SegmentReader
( example )。
I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup.
I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000
directory.
The problem is that I cannot figure out how to actually read the website data (URLs and HTML) in a Java/Scala program. I read this document, but find it a bit overwhelming since I've never used Hadoop.
I tried to adapt the example code to my environment, and this is what I arrived at (mostly by guesswprk):
val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()), ".../apache-nutch-1.8/crawl/segments/20140711115438/content/part-00000", new Configuration())
var key = null
var value = null
reader.next(key, value) // test for a single value
println(key)
println(value)
However, I am getting this exception when I run it:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1873)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
I am not sure how to work with a MapFile.Reader
, specifically, what constructor parameters I am supposed to pass to it. What Configuration objects am I supposed to pass in? Is that the correct FileSystem? And is that the data file I'm interested in?
Scala:
val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)
val webdata = Stream.continually {
val key = new Text()
val content = new Content()
reader.next(key, content)
(key, content)
}
println(webdata.head)
Java:
public class ContentReader {
public static void main(String[] args) throws IOException {
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (reader.next(key, content)) {
try {
System.out.write(content.getContent(), 0,
content.getContent().length);
} catch (Exception e) {
}
}
}
}
Alternatively, you can use org.apache.nutch.segment.SegmentReader
(example).
这篇关于如何阅读Java / Scala中的Nutch内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!