查找域及其子域中的所有网页 [英] Find all the web pages in a domain and its subdomains

查看:60
本文介绍了查找域及其子域中的所有网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方法来查找域中的所有网页和子域.例如,在 uoregon.edu 域中,我想查找该域和所有子域(例如,cs.uoregon.edu)中的所有网页.

I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu).

我一直在研究 nutch,我认为它可以胜任.但是,似乎 nutch 会下载整个网页并将它们编入索引以供以后搜索.但是,我想要一个只扫描网页以查找属于同一域的 URL 的爬虫.此外,似乎 nutch 以序列化格式保存了 linkdb.我怎样才能阅读它?我试过solr,它可以读取nutch 收集的数据.但是,我认为我不需要 solr,因为我没有执行任何搜索.我只需要属于给定域的 URL.

I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch saves the linkdb in a serialized format. How can I read it? I tried solr, and it can read nutch's collected data. But, I dont think I need solr, since I am not performing any searches. All I need are the URLs that belong to a given domain.

谢谢

推荐答案

如果您熟悉 ruby​​,请考虑使用 anemone.精彩的爬虫框架.这是开箱即用的示例代码.

If you're familiar with ruby, consider using anemone. Wonderful crawling framework. Here is sample code that works out of the box.

require 'anemone'

urls = []

Anemone.crawl(site_url)
  anemone.on_every_page do |page|
    urls << page.url
  end
end

https://github.com/chriskite/anemone

免责声明:您需要使用问题中的补丁来抓取子域,并且您可能需要考虑添加最大页数.

Disclaimer: You need to use a patch from the issues to crawl subdomains and you might want to consider adding a maximum page count.

这篇关于查找域及其子域中的所有网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆