Scrapy with Splash Only Scrapes 1 页 [英] Scrapy With Splash Only Scrapes 1 Page

查看:33
本文介绍了Scrapy with Splash Only Scrapes 1 页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取多个 URL,但由于某种原因,只有 1 个站点显示的结果.在每种情况下,都会显示 start_urls 中的最后一个 URL.

I am trying to scrape multiple URLs, but for some reason only results for 1 site show. In every case it is the last URL in start_urls that is shown.

我相信我将问题缩小到我的解析函数.

I believe I have the problem narrowed down to my parse function.

对我做错了什么有任何想法吗?

Any ideas on what I'm doing wrong?

谢谢!

class HeatSpider(scrapy.Spider):
name = "heat"

start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2']

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint='render.html',
            args={'wait': 8},
        )

def parse(self, response):
    for metric in response.css('.matrix-data'):
        yield {
            'City': response.css('title::text').extract_first(),
            'Metric Data Title': metric.css('.title::text').extract_first(),
            'Metric Data Price': metric.css('.price::text').extract_first(),
        }

我已更改代码以帮助调试.运行此代码后,我的 csv 如下所示:csv 结果每个 url 都有一行,应该有,但只有一行填写了信息.

I have altered my code to help debug. After running this code, my csv looks like this: csv results There is a row for every url, as there should be, but only one row is filled out with information.

class HeatSpider(scrapy.Spider):
name = "heat"

start_urls = ['https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2', 'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2']

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint='render.html',
            args={'wait': 8},
        )


def parse(self, response):
    yield {
        'City': response.css('title::text').extract_first(),
        'Metric Data Title': response.css('.matrix-data .title::text').extract(),
        'Metric Data Price': response.css('.matrix-data .price::text').extract(),
        'url': response.url,
    }

编辑 2:这是完整的输出 http://pastebin.com/cLM3T05P在第 46 行,您可以看到空单元格

EDIT 2: Here is the full output http://pastebin.com/cLM3T05P On line 46 you can see the empty cells

推荐答案

对我有用的是添加请求之间的延迟:

What worked for me was adding the delay between the requests:

下载器应该等待的时间(以秒为单位)从同一网站下载连续的页面.这个可以用限制爬取速度,避免对服务器造成太大冲击.

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.

DOWNLOAD_DELAY = 5

在 4 个 url 上进行了测试并得到了所有结果:

Tested it on the 4 urls and got the results for all of them:

start_urls = [
    'https://www.expedia.com/Hotel-Search?#&destination=new+york&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
    'https://www.expedia.com/Hotel-Search?#&destination=dallas&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
    'https://www.expedia.com/Hotel-Search?#&destination=washington&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
    'https://www.expedia.com/Hotel-Search?#&destination=philadelphia&startDate=11/15/2016&endDate=11/16/2016&regionId=&adults=2',
]

这篇关于Scrapy with Splash Only Scrapes 1 页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆