IMDB scrapy 获取所有电影数据 [英] IMDB scrapy get all movie data

查看:85
本文介绍了IMDB scrapy 获取所有电影数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开展一个课堂项目,并试图获取 2016 年之前的所有 IMDB 电影数据(标题、预算等).我采用了 https://github.com/alexwhb/IMDB-spider/blob/master/tutorial/spiders/spider.py.

I am working on a class project and trying to get all IMDB movie data (titles, budgets. etc.) up until 2016. I adopted the code from https://github.com/alexwhb/IMDB-spider/blob/master/tutorial/spiders/spider.py.

我的想法是:从 i in range(1874,2016)(因为 1874 是 http 上显示的最早年份)://www.imdb.com/year/),将程序定向到相应年份的网站,并从该 url 中获取数据.

My thought is: from i in range(1874,2016) (since 1874 is the earliest year shown on http://www.imdb.com/year/), direct the program to the corresponding year's website, and grab the data from that url.

但问题是,每年的每个页面只显示 50 部电影,所以在抓取了 50 部电影后,我如何才能进入下一页?在每年爬行之后,我如何才能进入下一年?到目前为止,这是我解析 url 部分的代码,但它只能抓取特定年份的 50 部电影.

But the problem is, each page for each year only show 50 movies, so after crawling the 50 movies, how can I move on to the next page? And after crawling each year, how can I move on to next year? This is my code for the parsing url part so far, but it is only able to crawls 50 movies for a particular year.

class tutorialSpider(scrapy.Spider):
    name = "tutorial"
    allowed_domains = ["imdb.com"]
    start_urls = ["http://www.imdb.com/search/title?year=2014,2014&title_type=feature&sort=moviemeter,asc"] 

    def parse(self, response):
            for sel in response.xpath("//*[@class='results']/tr/td[3]"):
                item = MovieItem()
                item['Title'] = sel.xpath('a/text()').extract()[0]
                item['MianPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
                request = scrapy.Request(item['MianPageUrl'], callback=self.parseMovieDetails)
                request.meta['item'] = item
                yield request

推荐答案

您可以使用 CrawlSpiders 来简化您的任务.正如您将在下面看到的,start_requests 动态生成 URL 列表,而 parse_page 仅提取要抓取的电影.查找和跟踪下一步"链接由 rules 属性完成.

You can use CrawlSpiders to simplify your task. As you'll see below, start_requests dynamically generates the list of URLs while parse_page only extracts the movies to crawl. Finding and following the 'Next' link is done by the rules attribute.

我同意@Padraic Cunningham 的观点,即硬编码值不是一个好主意.我添加了蜘蛛参数,以便您可以调用:scrapy crawl imdb -a start=1950 -a end=1980(如果没有任何参数,scrapy 将默认为 1874-2016).

I agree with @Padraic Cunningham that hard-coding values is not a great idea. I've added spider arguments so that you can call: scrapy crawl imdb -a start=1950 -a end=1980 (the scraper will default to 1874-2016 if it doesn't get any arguments).

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from imdbyear.items import MovieItem

class IMDBSpider(CrawlSpider):
    name = 'imdb'
    rules = (
        # extract links at the bottom of the page. note that there are 'Prev' and 'Next'
        # links, so a bit of additional filtering is needed
        Rule(LinkExtractor(restrict_xpaths=('//*[@id="right"]/span/a')),
            process_links=lambda links: filter(lambda l: 'Next' in l.text, links),
            callback='parse_page',
            follow=True),
    )

    def __init__(self, start=None, end=None, *args, **kwargs):
      super(IMDBSpider, self).__init__(*args, **kwargs)
      self.start_year = int(start) if start else 1874
      self.end_year = int(end) if end else 2016

    # generate start_urls dynamically
    def start_requests(self):
        for year in range(self.start_year, self.end_year+1):
            yield scrapy.Request('http://www.imdb.com/search/title?year=%d,%d&title_type=feature&sort=moviemeter,asc' % (year, year))

    def parse_page(self, response):
        for sel in response.xpath("//*[@class='results']/tr/td[3]"):
            item = MovieItem()
            item['Title'] = sel.xpath('a/text()').extract()[0]
            # note -- you had 'MianPageUrl' as your scrapy field name. I would recommend fixing this typo
            # (you will need to change it in items.py as well)
            item['MainPageUrl']= "http://imdb.com"+sel.xpath('a/@href').extract()[0]
            request = scrapy.Request(item['MainPageUrl'], callback=self.parseMovieDetails)
            request.meta['item'] = item
            yield request
    # make sure that the dynamically generated start_urls are parsed as well
    parse_start_url = parse_page

    # do your magic
    def parseMovieDetails(self, response):
        pass

这篇关于IMDB scrapy 获取所有电影数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆