IMDB 网络爬虫 - Scrapy - Python [英] IMDB web crawler - Scrapy - Python

查看:109
本文介绍了IMDB 网络爬虫 - Scrapy - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

导入scrapy从 imdbscrape.items 导入 MovieItem类电影蜘蛛(scrapy.Spider):名称 = '电影'allowed_domains = ['imdb.com']start_urls = ['https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc']定义解析(自我,响应):urls = response.css('h3.lister-item-header > a::attr(href)').extract()对于网址中的网址:yield scrapy.Request(url=response.urljoin(url),callback=self.parse_movie)nextpg = response.css('div.desc > a::attr(href)').extract_first()如果下一个:nextpg = response.urljoin(nextpg)产生scrapy.Request(url=nextpg,callback=self.parse)def parse_movie(self, response):项目 = 电影项目()item['title'] = self.getTitle(response)item['year'] = self.getYear(response)item['rating'] = self.getRating(response)item['genre'] = self.getGenre(response)item['director'] = self.getDirector(response)item['summary'] = self.getSummary(response)item['actors'] = self.getActors(response)产量项目

我编写了上述代码,用于抓取从 2017 年至今的所有 imdb 电影.但是这段代码只能抓取 100 部电影.请帮忙.

解决方案

我认为问题出在

nextpg = response.css('div.desc > a::attr(href)').extract_first()

在这个页面https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc

下一页链接的代码是这个

<span class="lister-current-first-item">1</span>至<span class="lister-current-last-item">50</span>共 24,842 个标题<span class="ghost">|</span><a href="?year=2017,2018&amp;title_type=feature&amp;sort=moviemeter,asc&amp;page=2&amp;ref_=adv_nxt" class="lister-page-next-page" ref-marker="adv_nxt">下一步»</a>

您的代码使用锚文本获取链接的 href Next >>

这是什么

https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc&page=2&ref_=adv_nxt

你转到那个页面,然后抓取接下来的 50 部电影

但是,div 中具有 desc 类的 html 中有两个链接.不像第一页.

第一个链接是上一个链接,而不是下一个链接.

<span class="lister-current-first-item">51</span>至<span class="lister-current-last-item">100</span>共 24,842 个标题<span class="ghost">|</span><a href="?year=2017,2018&amp;title_type=feature&amp;sort=moviemeter,asc&amp;page=1&amp;ref_=adv_prv" class="lister-page-prev prev-page" ref-marker="adv_nxt">« 上一个<span class="ghost">|</span><a href="?year=2017,2018&amp;title_type=feature&amp;sort=moviemeter,asc&amp;page=3&amp;ref_=adv_nxt" class="lister-page-next-page" ref-marker="adv_nxt">下一步»</a>

我要做的是将计数器设置为 0.

在成功抓取时递增.

如果计数器大于 0,则获取第二个链接并转到该链接并抓取该页面上的结果

如果计数器不大于 0,则抓取第一个链接并转到该链接并抓取该页面上的结果

import scrapy
from imdbscrape.items import MovieItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['imdb.com']
    start_urls = ['https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc']

    def parse(self, response):
        urls = response.css('h3.lister-item-header > a::attr(href)').extract()
        for url in urls:
            yield scrapy.Request(url=response.urljoin(url),callback=self.parse_movie)

        nextpg = response.css('div.desc > a::attr(href)').extract_first()
        if nextpg:
            nextpg = response.urljoin(nextpg)
            yield scrapy.Request(url=nextpg,callback=self.parse)

    def parse_movie(self, response):
        item = MovieItem()
        item['title'] = self.getTitle(response)
        item['year'] = self.getYear(response)
        item['rating'] = self.getRating(response)
        item['genre'] = self.getGenre(response)
        item['director'] = self.getDirector(response)
        item['summary'] = self.getSummary(response)
        item['actors'] = self.getActors(response)
        yield item

I have wrote the above code for scraping all imdb movies from 2017 to till date. But this code only scrapes 100 movies. Please Help.

解决方案

I believe the issue is with

nextpg = response.css('div.desc > a::attr(href)').extract_first()

On this page https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc

the code for the next page link is this

<div class="desc">
    <span class="lister-current-first-item">1</span> to
    <span class="lister-current-last-item">50</span> of 24,842 titles
    <span class="ghost">|</span>
    <a href="?year=2017,2018&amp;title_type=feature&amp;sort=moviemeter,asc&amp;page=2&amp;ref_=adv_nxt" class="lister-page-next next-page" ref-marker="adv_nxt">Next »</a>
</div>

Your code grabs the href of the link with the anchor text Next >>

which is this

https://www.imdb.com/search/title?year=2017,2018&title_type=feature&sort=moviemeter,asc&page=2&ref_=adv_nxt

you go to that page and you scrape the next 50 movies

however the html in the div with a class of desc has TWO links in it. Not one like the first page.

The first link is the previous link, not the next link.

<div class="desc">
    <span class="lister-current-first-item">51</span> to
    <span class="lister-current-last-item">100</span> of 24,842 titles
    <span class="ghost">|</span> <a href="?year=2017,2018&amp;title_type=feature&amp;sort=moviemeter,asc&amp;page=1&amp;ref_=adv_prv" class="lister-page-prev prev-page" ref-marker="adv_nxt">« Previous</a>
    <span class="ghost">|</span> <a href="?year=2017,2018&amp;title_type=feature&amp;sort=moviemeter,asc&amp;page=3&amp;ref_=adv_nxt" class="lister-page-next next-page" ref-marker="adv_nxt">Next »</a>
</div>

What I would do is set a counter to 0.

Increment on a successful scrape.

If the counter is greater than 0 then grab the second link and goto that link and scrape the results on that page

If the counter is not greater than 0 then grab the first link and goto that and scrape the results on that page

这篇关于IMDB 网络爬虫 - Scrapy - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆