如何扩展 Nutch 进行文章抓取 [英] How to extend Nutch for article crawling

查看:52
本文介绍了如何扩展 Nutch 进行文章抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个框架来抓取文章,然后我找到了 Nutch 2.1.这是我的计划和每个问题:

I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my plan and questions in each:

1

添加文章列表页面到url/seed.txt这是一个问题.我真正想要被索引的是文章页面,而不是文章列表页面.但是,如果我不允许列表页面被索引,Nutch 将什么也不做,因为列表页面是入口.那么,如何只索引没有列表页面的文章页面?

Add article list pages into url/seed.txt Here's one problem. What I actually want to be indexed is the article pages, not the article list pages. But, if I don't allow the list page to be indexed, Nutch will do nothing because the list page is the entrance. So, how can I index only the article page without list pages?

2

编写一个插件来从 html 中解析出作者"、日期"、文章正文"、标题"以及其他信息.Nutch 2.1 中的Parser"插件接口是:解析 getParse(String url, WebPage page)并且WebPage"类有一些预定义的属性:

Write a plugin to parse out the 'author', 'date', 'article body', 'headline' and maybe other information from html. The 'Parser' plugin interface in Nutch 2.1 is: Parse getParse(String url, WebPage page) And the 'WebPage' class has some predefined attributs:

public class WebPage extends PersistentBase {
  // ...
  private Utf8 baseUrl;
  // ...
  private ByteBuffer content; // <== This becomes null in IndexFilter
  // ...
  private Utf8 title;
  private Utf8 text;
  // ...
  private Map<Utf8,Utf8> headers;
  private Map<Utf8,Utf8> outlinks;
  private Map<Utf8,Utf8> inlinks;
  private Map<Utf8,Utf8> markers;
  private Map<Utf8,ByteBuffer> metadata;
  // ...
}

So, as you can see, there are 5 maps I can put my specified attributes in. But, 'headers', 'outlinks', 'inlinks' seem not used for this. Maybe I could put those information into markers or metadata. Are they designed for this purpose?
BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content content)', and seems more reasonable for me.

3

文章索引到Solr后,另一个应用程序可以通过'date'查询,然后将文章信息存储到Mysql中.我这里的问题是:Nutch 可以将文章直接存入Mysql 吗?或者我可以写一个插件来指定索引行为吗?

After the articles are indexed into Solr, another application can query it by 'date' then store the article information into Mysql. My question here is: can Nutch store the article directly into Mysql? Or can I write a plugin to specify the index behavior?

Nutch 是我的理想选择吗?如果没有,你们会为我推荐另一个高质量的框架/库吗?感谢您的帮助.

Is Nutch a good choice for my purpose? If not, do you guys suggest another good quality framework/library for me? Thanks for your help.

推荐答案

如果您只需要从几个网站中提取文章,请查看 http://www.crawl-anywhere.com/

If article extraction from a few websites is all that you are looking for, then check out http://www.crawl-anywhere.com/

它带有一个管理用户界面,您可以在其中指定要使用样板文章提取器(这很棒).您还可以通过 URL 模式匹配指定要抓取的页面与要抓取和编入索引的页面.

It comes with an admin UI where you can specify that you want to use boilerpipe article extractor (which is great). You can also specify by URL pattern matching which pages you want crawled vs which page you want crawled AND indexed.

这篇关于如何扩展 Nutch 进行文章抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆