Apache的Nutch的2.1 - 如何获得完整的源代码code [英] Apache Nutch 2.1 - How get complete source code

查看:171
本文介绍了Apache的Nutch的2.1 - 如何获得完整的源代码code的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图写我自己的Nutch插件爬行网页。的问题是,我需要确定,如果有一些特殊的标记,例如上的网页。有官方文档,这是可能的一些注意事项使用Document.getElementsByTagName(富),但它不是为我工作。你有什么想法?

I am trying to write my own Nutch plugin for crawling webpages. The problem is that I need to identify if there is some special tag, e.g. on the webpage. There is some note in official documentation that this is possible using Document.getElementsByTagName("foo") but this is not working for me. Do you have any idea?

我的第二个问题是,如果我确定了上面的标签,我想从这个网页,供标签被认定一些其他的标签......有没有什么办法来存储被抓取网页的完整源$ C ​​$ C在某一时刻?

My second question is that if I identified tag above, I would like to get some other tags from this webpage where tag was identified... is there any way to store complete source code of the webpage which is crawled at some moment?

谢谢,扬

推荐答案

如果要提取基于HTML标记的内容,你可以看一下XPath的过滤器插件:的 http://www.atlantbh.com/$p$pcise-data-extraction-with-apache-nutch /
你可以写一个XPath查询,并在插件配置它提取所需的信息。

If you want to extract content based on an HTML tag, you could look at the xpath-filter plugin: http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ You can write an xpath query and configure it in the plugin to extract the information you need.

另一种方法是编写一个插件(因为你现在在做),并使用HTML / XML解析器来获取信息的。
下面是当我需要提取一些内容了特定的div我做了什么:

Another option is to write a plugin (as you are doing at the moment) and use an HTML/XML parser to get the information out. Here's what I have done when I needed to extract some content out of a specific div:

  @Override
  public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {

        //LOG.info("filter init: ");
        Metadata metadata = parse.getData().getParseMeta();
        String fullContent = metadata.get("fullcontent");

        Document document = Jsoup.parse(fullContent); 
        Element contentwrapper = document.select("div#content").first();

        //LOG.info("fullcontent");
        //LOG.info(contentwrapper);


        // Add field
        doc.add("contentwrapper", contentwrapper.text());

        return doc;
  }

这篇关于Apache的Nutch的2.1 - 如何获得完整的源代码code的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆