Scrapy 从任何网站获取所有链接 [英] Scrapy get all links from any website

查看:47
本文介绍了Scrapy 从任何网站获取所有链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码用于 Python 3 中的网络爬虫:

I have the following code for a web crawler in Python 3:

import requests
from bs4 import BeautifulSoup
import re

def get_links(link):

    return_links = []

    r = requests.get(link)

    soup = BeautifulSoup(r.content, "lxml")

    if r.status_code != 200:
        print("Error. Something is wrong here")
    else:
        for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
            return_links.append(link.get('href')))

def recursive_search(links)
    for i in links:
        links.append(get_links(i))
    recursive_search(links)


recursive_search(get_links("https://www.brandonskerritt.github.io"))

代码基本上从我的 GitHub 页面网站上获取所有链接,然后从这些链接中获取所有链接,依此类推,直到时间结束或发生错误.

The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.

我想在 Scrapy 中重新创建这段代码,这样它就可以服从 robots.txt 并成为一个更好的网络爬虫.我在网上研究过,我只能找到关于如何抓取特定域的教程/指南/stackoverflow/quora/博客文章(例如,allowed_domains=["google.com"]).我不想这样做.我想创建可以递归抓取所有网站的代码.

I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.

这不是什么大问题,但所有的博客文章等都只展示了如何从特定网站获取链接(例如,他的链接可能在列表标签​​中).我上面的代码适用于所有锚标记,无论它在哪个网站上运行.

This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.

我不想在野外使用它,我需要它用于演示目的,所以我不会突然因为过度的网络爬行而惹恼每个人.

I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.

任何帮助将不胜感激!

推荐答案

有一整节的scrapy指南专门用于广泛爬行.我建议您细化您的设置以成功执行此操作.

There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.

为了在scrapy中重新创建你需要的行为,你必须

For recreating the behaviour you need in scrapy, you must

  • 在页面中设置起始网址.
  • 编写一个解析函数,跟踪所有链接并递归调用自身,将请求的 url 添加到蜘蛛变量中
  • set your start url in your page.
  • write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls

未经测试的示例(当然可以改进):

An untested example (that can be, of course, refined):

class AllSpider(scrapy.Spider):
    name = 'all'

    start_urls = ['https://yourgithub.com']

    def __init__(self):
        self.links=[]

    def parse(self, response):
        self.links.append(response.url)
        for href in response.css('a::attr(href)'):
            yield response.follow(href, self.parse)

这篇关于Scrapy 从任何网站获取所有链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆