如何扩展 Nutch 进行文章抓取 [英] How to extend Nutch for article crawling

查看：52 发布时间：2021/6/11 18:42:33 web-crawler nutch

本文介绍了如何扩展 Nutch 进行文章抓取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一个框架来抓取文章，然后我找到了 Nutch 2.1.这是我的计划和每个问题:

I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each:

添加文章列表页面到url/seed.txt这是一个问题.我真正想要被索引的是文章页面，而不是文章列表页面.但是，如果我不允许列表页面被索引，Nutch 将什么也不做，因为列表页面是入口.那么，如何只索引没有列表页面的文章页面?

Add article list pages into url/seed.txt Here's one problem. What I actually want to be indexed is the article pages, not the article list pages. But, if I don't allow the list page to be indexed, Nutch will do nothing because the list page is the entrance. So, how can I index only the article page without list pages?

编写一个插件来从 html 中解析出作者"、日期"、文章正文"、标题"以及其他信息.Nutch 2.1 中的Parser"插件接口是:解析 getParse(String url, WebPage page)并且WebPage"类有一些预定义的属性:

Write a plugin to parse out the 'author', 'date', 'article body', 'headline' and maybe other information from html. The 'Parser' plugin interface in Nutch 2.1 is: Parse getParse(String url, WebPage page) And the 'WebPage' class has some predefined attributs:

public class WebPage extends PersistentBase {
  // ...
  private Utf8 baseUrl;
  // ...
  private ByteBuffer content; // <== This becomes null in IndexFilter
  // ...
  private Utf8 title;
  private Utf8 text;
  // ...
  private Map<Utf8,Utf8> headers;
  private Map<Utf8,Utf8> outlinks;
  private Map<Utf8,Utf8> inlinks;
  private Map<Utf8,Utf8> markers;
  private Map<Utf8,ByteBuffer> metadata;
  // ...
}

So, as you can see, there are 5 maps I can put my specified attributes in. But, 'headers', 'outlinks', 'inlinks' seem not used for this. Maybe I could put those information into markers or metadata. Are they designed for this purpose?
BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content content)', and seems more reasonable for me.

文章索引到Solr后，另一个应用程序可以通过'date'查询，然后将文章信息存储到Mysql中.我这里的问题是:Nutch 可以将文章直接存入Mysql 吗?或者我可以写一个插件来指定索引行为吗?

After the articles are indexed into Solr, another application can query it by 'date' then store the article information into Mysql. My question here is: can Nutch store the article directly into Mysql? Or can I write a plugin to specify the index behavior?

Nutch 是我的理想选择吗?如果没有，你们会为我推荐另一个高质量的框架/库吗?感谢您的帮助.

Is Nutch a good choice for my purpose? If not, do you guys suggest another good quality framework/library for me? Thanks for your help.

如何扩展 Nutch 进行文章抓取 [英] How to extend Nutch for article crawling

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何扩展 Nutch 进行文章抓取 [英] How to extend Nutch for article crawling

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭