如何在scrapy中获取原始start_url(重定向前) [英] how to get the original start_url in scrapy (before redirect)

查看：47 发布时间：2021/7/5 19:39:34 python redirect web-scraping scrapy

本文介绍了如何在scrapy中获取原始start_url(重定向前)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Scrapy 抓取一些页面.我从 Excel 工作表中获取 start_urls，我需要将 url 保存在项目中.

I'm using Scrapy to crawl some pages. I fetch the start_urls from an excel sheet and I need to save the url in the item.

class abc_Spider(BaseSpider):
   name = 'abc'
   allowed_domains = ['abc.com']         
   wb = xlrd.open_workbook(path + '/somefile.xlsx')
   wb.sheet_names()
   sh = wb.sheet_by_name(u'Sheet1')
   first_column = sh.col_values(15)
   start_urls = first_column
   handle_httpstatus_list = [404]

   def parse(self, response):
      item = abcspiderItem()
      item['url'] = response.url

问题是该 url 被重定向到其他一些 url(因此在响应 url 中给出了其他内容).如何获取我从 excel 中得到的原始 url?

The problem is that the url gets redirected to some other url (and thus gives something else in the response url). How do I get the original url that I got from the excel?

如何在scrapy中获取原始start_url(重定向前) [英] how to get the original start_url in scrapy (before redirect)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在scrapy中获取原始start_url(重定向前) [英] how to get the original start_url in scrapy (before redirect)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭