如何写一个爬虫? [英] How to write a crawler?

查看:40
本文介绍了如何写一个爬虫?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我曾想尝试编写一个简单的爬虫程序,它可能会爬行并为我们的 NPO 网站和内容生成其发现的列表.

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO's websites and content.

有人对如何做到这一点有任何想法吗?你指的爬虫从哪里开始?它如何发回其发现并继续爬行?它如何知道它找到了什么,等等.

Does anybody have any thoughts on how to do this? Where do you point the crawler to get started? How does it send back its findings and still keep crawling? How does it know what it finds, etc,etc.

推荐答案

可以肯定的是,您将重新发明轮子.但这是基础知识:

You'll be reinventing the wheel, to be sure. But here's the basics:

  • 未访问的 URL 列表 - 用一个或多个起始页面播种
  • 访问过的 URL 列表 - 这样您就不会绕圈子
  • 针对您不感兴趣的 URL 的一组规则 - 这样您就不会索引整个互联网

将它们放在持久存储中,这样您就可以停止和启动爬虫而不会丢失状态.

Put these in persistent storage, so you can stop and start the crawler without losing state.

算法是:

while(list of unvisited URLs is not empty) {
    take URL from list
    remove it from the unvisited list and add it to the visited list
    fetch content
    record whatever it is you want to about the content
    if content is HTML {
        parse out URLs from links
        foreach URL {
           if it matches your rules
              and it's not already in either the visited or unvisited list
              add it to the unvisited list
        }
    }
}

这篇关于如何写一个爬虫?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆