Scrapy 按照所有链接获取状态 [英] Scrapy follow all the links and get status

查看：47 发布时间：2021/7/16 21:54:36 python scrapy

本文介绍了Scrapy 按照所有链接获取状态的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想跟踪网站的所有链接并获取每个链接的状态，例如 404,200.我试过这个:

I want to follow all the links of the website and get the status of every links like 404,200. I tried this:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class someSpider(CrawlSpider):
  name = 'linkscrawl'
  item = []
  allowed_domains = ['mysite.com']
  start_urls = ['//mysite.com/']

  rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
  )

  def parse_obj(self,response):
    item = response.url
    print(item)

我可以在控制台上看到没有状态代码的链接，例如:

I can see the links without status code on the console like:

mysite.com/navbar.html
mysite.com/home
mysite.com/aboutus.html
mysite.com/services1.html
mysite.com/services3.html
mysite.com/services5.html

但是如何将所有链接的状态保存在文本文件中?

but how to save in text file with status of all links?

推荐答案

我如下解决了这个问题.希望这对有需要的人有所帮助.

I solved this as below. Hope this will help for anyone who needs.

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class LinkscrawlItem(scrapy.Item):
    # define the fields for your item here like:
    link = scrapy.Field()
    attr = scrapy.Field()

class someSpider(CrawlSpider):
  name = 'linkscrawl'
  item = []

  allowed_domains = ['mysite.com']
  start_urls = ['//www.mysite.com/']

  rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
  )

  def parse_obj(self,response):
    #print(response.status)
    item = LinkscrawlItem()
    item["link"] = str(response.url)+":"+str(response.status)
    # item["link_res"] = response.status
    # status = response.url
    # item = response.url
    # print(item)
    filename = 'links.txt'
    with open(filename, 'a') as f:
      f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
    self.log('Saved file %s' % filename)

这篇关于Scrapy 按照所有链接获取状态的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy 按照所有链接获取状态 [英] Scrapy follow all the links and get status

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy 按照所有链接获取状态 [英] Scrapy follow all the links and get status

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭