如何通过 Apache Nutch 对特定主题进行网络抓取? [英] How to conduct a web crawl for specific topic via Apache Nutch?

查看:52
本文介绍了如何通过 Apache Nutch 对特定主题进行网络抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是这个领域的新手,作为一名学生,我们必须为特定主题创建一个门户网站.作为第一步,我们必须抓取网络(或其中的一部分),以便我们可以在索引和排名之前收集此主题的链接,最终目的是将它们作为我们门户的数据库提供.

I'm new to this field and as a student we have to create a web portal for a specific topic. As a first step we have to crawl the web (or part of it) so we can gather links for this topic before we index and rank them with the final purpose to feed them as database for our portal.

问题是我无法提出正确的方法论.假设我们门户的主题是健康保险".

The thing is that I cannot come up to the right methodology. Let's say the theme of our portal is "health insurance".

  1. 作为方法论和我需要的工具,我必须遵循哪些步骤?
  2. 有没有办法为特定内容引导 nutch?
  3. 我是否应该用广泛的链接填充我的 seeds.txt 解析大量链接然后过滤内容?
  1. What are the steps i have to follow as methodology and the tools I need?
  2. Is there a way to guide nutch for specific content?
  3. Should I fill my seeds.txt with a wide range of links parse a lot of links and then filter the content?

你可以在高层描述步骤,我会研究如何实施.

You can describe steps on high-level and i'll do the research how to implement.

推荐答案

Nutch 带有内置的 NaiveBayesParseFilter.您必须在 nutch-site.xml 中添加以下属性,并创建如下所述的训练文件.根据我的经验,即使有少量的培训文件,它也能表现得很好.当然越多越好.

Nutch is coming with a built in NaiveBayesParseFilter. You have to add the following property in nutch-site.xml and also create a training file as described below. From my experience It performs great even with a handful of documents for train. of course the more the merrier.

<property>
<name>plugin.includes</name>
<value>parsefilter-naivebayes</value>
</property>
<property>
  <name>parsefilter.naivebayes.trainfile</name>
  <value></value>
  <description>Set the name of the file to be used for Naive Bayes training. The format will be:
Each line contains two tab seperated parts
There are two columns/parts:
1. "1" or "0", "1" for relevant and "0" for irrelevant document.
3. Text (text that will be used for training)

Each row will be considered a new "document" for the classifier.
CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier.

  </description>
</property>

这篇关于如何通过 Apache Nutch 对特定主题进行网络抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆