将标题添加到 Scrapy Spider [英] Adding Headers to Scrapy Spider

查看：35 发布时间：2021/7/16 22:04:38 python scrapy

本文介绍了将标题添加到 Scrapy Spider的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于一个项目，我正在运行大量针对某些搜索词的 Scrapy 请求.这些请求使用相同的搜索词，但时间范围不同，如下面网址中的日期所示.

For a project, I am running a broad number of Scrapy requests for certain search terms. These requests use the same search terms but different time horizons, as shown through the dates in the URLs below.

尽管 URL 引用的日期和页面不同，但我收到的值与所有请求的输出值相同.看起来脚本正在获取获得的第一个值，并将相同的输出分配给所有后续请求.

Despite the different dates and different pages the URLs refer to, I am receiving the same value as output for all requests. It appears like the script is taking the first value obtained and is assigning the same output to all subsequent requests.

import scrapy

 class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['google.com']
    start_urls = ['https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2004%2Ccd_max%3A12%2F31%2F2004&tbm=nws',
                  'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2005%2Ccd_max%3A12%2F31%2F2005&tbm=nws',
                  'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2006%2Ccd_max%3A12%2F31%2F2006&tbm=nws',
    ]

    def parse(self, response):
        item = {
            'search_title': response.css('input#sbhost::attr(value)').get(),
            'results': response.css('#resultStats::text').get(),
            'url': response.url,
        }
        yield item

我发现了一个话题讨论与 BeautifulSoup 类似的问题.解决方案是向脚本添加标头，从而使其使用浏览器作为用户代理:

I have found a thread discussing a similar problem with BeautifulSoup. The solution was here to add headers to the script, hence making it use a browser as User-Agent:

headers = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)

在 Scrapy 中应用标头的方法似乎有所不同.有谁知道如何最好地将它包含在 Scrapy 中，特别是参考 start_urls，它一次包含多个 URL?

The approach to apply the headers in Scrapy seems to be different though. Does anyone know how it can best be included in Scrapy, particularly with reference to start_urls, which contains several URLs at once?

将标题添加到 Scrapy Spider [英] Adding Headers to Scrapy Spider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将标题添加到 Scrapy Spider [英] Adding Headers to Scrapy Spider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭