将标题添加到 Scrapy Spider [英] Adding Headers to Scrapy Spider

查看:35
本文介绍了将标题添加到 Scrapy Spider的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一个项目,我正在运行大量针对某些搜索词的 Scrapy 请求.这些请求使用相同的搜索词,但时间范围不同,如下面网址中的日期所示.

For a project, I am running a broad number of Scrapy requests for certain search terms. These requests use the same search terms but different time horizons, as shown through the dates in the URLs below.

尽管 URL 引用的日期和页面不同,但我收到的值与所有请求的输出值相同.看起来脚本正在获取获得的第一个值,并将相同的输出分配给所有后续请求.

Despite the different dates and different pages the URLs refer to, I am receiving the same value as output for all requests. It appears like the script is taking the first value obtained and is assigning the same output to all subsequent requests.

import scrapy

 class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['google.com']
    start_urls = ['https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2004%2Ccd_max%3A12%2F31%2F2004&tbm=nws',
                  'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2005%2Ccd_max%3A12%2F31%2F2005&tbm=nws',
                  'https://www.google.com/search?q=Activision&biw=1280&bih=607&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2006%2Ccd_max%3A12%2F31%2F2006&tbm=nws',
    ]

    def parse(self, response):
        item = {
            'search_title': response.css('input#sbhost::attr(value)').get(),
            'results': response.css('#resultStats::text').get(),
            'url': response.url,
        }
        yield item

我发现了一个话题 讨论与 BeautifulSoup 类似的问题.解决方案是向脚本添加标头,从而使其使用浏览器作为用户代理:

I have found a thread discussing a similar problem with BeautifulSoup. The solution was here to add headers to the script, hence making it use a browser as User-Agent:

headers = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)

在 Scrapy 中应用标头的方法 似乎有所不同.有谁知道如何最好地将它包含在 Scrapy 中,特别是参考 start_urls,它一次包含多个 URL?

The approach to apply the headers in Scrapy seems to be different though. Does anyone know how it can best be included in Scrapy, particularly with reference to start_urls, which contains several URLs at once?

推荐答案

您无需在此处修改标题.您需要设置 Scrapy 允许您直接执行的用户代理.

You don't need to modify the headers here. You need to set the user agent which Scrapy allows you to do directly.

import scrapy

class QuotesSpider(scrapy.Spider):
    # ...
    user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
    # ...

现在你会得到如下输出:

Now you'll get output like:

'results': 'About 357 results', ...
'results': 'About 215 results', ...
'results': 'About 870 results', ...

这篇关于将标题添加到 Scrapy Spider的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆