如何从 nutch 获取 html 内容 [英] How to get the html content from nutch

查看:43
本文介绍了如何从 nutch 获取 html 内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有什么办法可以在抓取网页的同时获取每个网页的html内容?

Is there is any way to get the html content of each webpage in nutch while crawling the web page?

推荐答案

是的,您可以直接导出抓取到的片段的内容.这并不简单,但对我来说效果很好.首先,使用以下代码创建一个java项目:

Yes, you can acutally export the content of the crawled segments. It is not straightforward, but it works well for me. First, create a java project with the following code:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;

import java.io.File;
import java.io.FileOutputStream;

public class NutchSegmentOutputParser {

public static void main(String[] args) {

    if (args.length != 2) {
        System.out.println("usage: segmentdir (-local | -dfs <namenode:port>) outputdir");
        return;
    }

    try {
        Configuration conf = NutchConfiguration.create();
        FileSystem fs = FileSystem.get(conf);


        String segment = args[0];

        File outDir = new File(args[1]);
        if (!outDir.exists()) {
            if (outDir.mkdir()) {
                System.out.println("Creating output dir " + outDir.getAbsolutePath());
            }
        }

        Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);


        Text key = new Text();
        Content content = new Content();

        while (reader.next(key, content)) {
            String filename = key.toString().replaceFirst("http://", "").replaceAll("/", "___").trim();

            File f = new File(outDir.getCanonicalPath() + "/" + filename);
            FileOutputStream fos = new FileOutputStream(f);
            fos.write(content.getContent());
            fos.close();
            System.out.println(f.getAbsolutePath());
        }
        reader.close();
        fs.close();
    } catch (Exception e) {
        e.printStackTrace();
    }

}

}

我推荐使用Maven;添加以下依赖项:

I recommend using Maven; add the following dependencies:

     <dependency>
      <groupId>org.apache.nutch</groupId>
        <artifactId>nutch</artifactId>
        <version>1.5.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>0.23.1</version>
    </dependency>

并创建一个 jar 包(即 NutchSegmentOutputParser.jar)

and create a jar package (i.e. NutchSegmentOutputParser.jar)

您需要在您的机器上安装 Hadoop.然后运行:

You need Hadoop to be installed on your machine. Then run:

$/hadoop-dir/bin/hadoop --config \
NutchSegmentOutputParser.jar:~/.m2/repository/org/apache/nutch/nutch/1.5.1/nutch-1.5.1.jar \
NutchSegmentOutputParser nutch-crawled-dir/2012xxxxxxxxx/ outdir

其中 nutch-crawled-dir/2012xxxxxxxxx/是您要从中提取内容的爬网目录(它包含segment"子目录),而 outdir 是输出目录.输出文件名是从 URI 生成的,但是,斜线被替换为_".

where nutch-crawled-dir/2012xxxxxxxxx/ is the crawled directory you want to extract content from (it contains 'segment' subdirectory) and outdir is an output dir. The output file names are generated from URI, however, the slashes are replaced by "_".

希望有帮助.

这篇关于如何从 nutch 获取 html 内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆