python scrapy没有抓取抓取列表中的所有网址 [英] python scrapy not crawling all urls in scraped list

查看:51
本文介绍了python scrapy没有抓取抓取列表中的所有网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此页面上列出的页面中抓取信息.https://pardo.ch/pardo/program/archive/2017/目录-films.html

I am trying to scrape information from the pages listed on this page. https://pardo.ch/pardo/program/archive/2017/catalog-films.html

xpath 选择器:

film_page_urls_startpage = sel.xpath('//article[@class="strip-list_link_all strip-list strip--color row row--5"]/a/@href').extract()

正确抓取所有 23 个网址.然而,蜘蛛似乎甚至没有尝试爬行所有 23 个.它每次只爬行 11 个.相同的 11 个.因为我使用的是 selenium,所以我可以看到它直接跳过第一页/url,而根本没有导航到它.什么给?

correctly scrapes all 23 urls. however, the spider doesn't even appear to try crawling all 23. it crawls only 11. the same 11 each time. since I'm using selenium, I can see it just jump right over the first page/url without ever navigating to it at all. what gives?

这是我的代码:

from scrapy import Spider
from scrapy.http import Request
from selenium import webdriver
from scrapy.selector import Selector
from time import sleep
from selenium.common.exceptions import NoSuchElementException
from scrapy.loader import ItemLoader
from films_locarno.items import FilmsLocarnoItemfrom scrapy import 

class FilmsLocarnoSpiderSpider(Spider):
name = 'films_locarno_spider'
allowed_domains = ['https://pardo.ch/']
start_urls = ['https://pardo.ch/pardo/program/archive/2017/catalog-films.html']

def start_requests(self):
    self.driver = webdriver.Firefox()
    self.driver.get('https://pardo.ch/pardo/program/archive/2017/catalog-films.html')
    sel = Selector(text=self.driver.page_source)

    #grab list of start pages for all 4/5 editions of festival available
    #list of film page urls on start page (letter A)
    film_page_urls_startpage = sel.xpath('//article[@class="strip-    list_link_all strip-list strip--color row row--5"]/a/@href').extract()
    film_page_urls_startpage_full = []
    for url in film_page_urls_startpage:
        film_page_fullurl = "https://pardo.ch" + url
        film_page_urls_startpage_full.append(film_page_fullurl)

    #navigate to startpage film_pages
    for url3 in film_page_urls_startpage_full:
        self.driver.get(url3)
        sel = Selector(text=self.driver.page_source)
        self.logger.info('Sleeping for 1 second')
        sleep(1)
        yield Request(url3, callback=self.parse_filmpage)
        self.logger.info('Sleeping for 2 seconds')
        sleep(2) 

我的输出日志显示[你可以忽略错误,它只是一个页面导航错误,因为已修复]:

my output log reads [you can ignore the ERROR, its only a page navigation error, since fixed]:

    2017-12-26 09:29:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: films_locarno)
2017-12-26 09:29:33 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['films_locarno.spiders'], 'BOT_NAME': 'films_locarno', 'NEWSPIDER_MODULE': 'films_locarno.spiders', 'FEED_URI': 'films_locarno6.csv', 'FEED_FORMAT': 'csv'}
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.images.ImagesPipeline']
2017-12-26 09:29:33 [scrapy.core.engine] INFO: Spider opened
2017-12-26 09:29:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-26 09:29:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-12-26 09:29:34 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session {"capabilities": {"firstMatch": [], "alwaysMatch": {"browserName": "firefox", "acceptInsecureCerts": true}}, "desiredCapabilities": {"browserName": "firefox", "acceptInsecureCerts": true}}
2017-12-26 09:29:41 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:41 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/catalog-films.html"}
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70"}
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:56 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:29:57 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:29:59 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70"}
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:03 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:04 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:06 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=968681&eid=70"}
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:09 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:10 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:12 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=959475&eid=70"}
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:14 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:15 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:17 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960897&eid=70"}
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:19 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:20 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:22 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960706&eid=70"}
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:25 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:26 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:28 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=929220&eid=70"}
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:32 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:33 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:35 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960742&eid=70"}
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:38 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-26 09:30:39 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:41 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960703&eid=70"}
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:44 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:45 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:47 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=963699&eid=70"}
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:50 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70> (referer: None)
2017-12-26 09:30:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70> (referer: None)
2017-12-26 09:30:51 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:54 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=964462&eid=70"}
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:58 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=968681&eid=70> (referer: None)
2017-12-26 09:30:59 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:02 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:05 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:31:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch<a href=\"?finit=B\" class=\"dd__list__link\">B</a>"}
2017-12-26 09:31:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:31:07 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scrapy/core/engine.py", line 127, in _next_request
    request = next(slot.start_requests)
  File "/Users/MNK1/Desktop/films_locarno/films_locarno/spiders/films_locarno_spider.py", line 48, in start_requests
    self.driver.get(films_list_page)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 268, in get
    self.execute(Command.GET, {'url': url})
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
    self.error_handler.check_response(response)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Malformed URL: https://pardo.ch<a href="?finit=B" class="dd__list__link">B</a> is not a valid URL.

2017-12-26 09:31:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=959475&eid=70> (referer: None)
2017-12-26 09:31:07 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:10 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296> referred in <None>
2017-12-26 09:31:10 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296> referred in <None>
2017-12-26 09:31:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=960897&eid=70> (referer: None)
2017-12-26 09:31:10 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:13 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F430%2FOC973705_P3001_240430.jpg&w=539&h=296> referred in <None>
2017-12-26 09:31:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=960706&eid=70> (referer: None)
2017-12-26 09:31:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70>
{'color': ['Color'],
 'country': ['Pakistan, USA'],
 'director': [''],
 'festival_edition': ['70th'],
 'festival_year': ['2017'],
 'film_year': ['2015'],
 'format_': ['DCP'],
 'image_urls': ['https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296'],
 'images': [{'checksum': '89dd9751e436eed7ae35f980c2e10bc3',
             'path': 'full/53cb39b642dcd6cea1e7898c9dc4777b844ea4fd.jpg',
             'url': 'https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296'}],
 'language': ['Urdu'],
 'length': ["40'"],
 'program': ['Open Doors: Screenings'],
 'title': ['A Girl in the River: The Price of Forgiveness']}
2017-12-26 09:31:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70>
{'color': ['Color'],
 'country': ['Switzerland'],
 'director': [''],
 'festival_edition': ['70th'],
 'festival_year': ['2017'],
 'film_year': ['2017'],
 'format_': ['DCP'],
 'image_urls': ['https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296'],
 'images': [{'checksum': 'cce5e9ffd3bad2b359c489ac4c51c25e',
             'path': 'full/84e0d100fc90acf2c0cfe8c38454a305e23b7408.jpg',
             'url': 'https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296'}],

[[edited for length]]


2017-12-26 09:31:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3038,
 'downloader/request_count': 11,
 'downloader/request_method_count/GET': 11,
 'downloader/response_bytes': 115519,
 'downloader/response_count': 11,
 'downloader/response_status_count/200': 11,
 'file_count': 11,
 'file_status_count/uptodate': 11,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 12, 26, 17, 31, 35, 820684),
 'item_scraped_count': 11,
 'log_count/DEBUG': 86,
 'log_count/ERROR': 1,
 'log_count/INFO': 43,
 'memusage/max': 79556608,
 'memusage/startup': 66007040,
 'response_received_count': 11,
 'scheduler/dequeued': 11,
 'scheduler/dequeued/memory': 11,
 'scheduler/enqueued': 11,
 'scheduler/enqueued/memory': 11,
 'start_time': datetime.datetime(2017, 12, 26, 17, 29, 33, 860768)}
2017-12-26 09:31:35 [scrapy.core.engine] INFO: Spider closed (finished)

推荐答案

我检查了这个

len(film_page_urls_startpage)

而我只有 11 个,而不是 23 个.

and I get only 11, not 23.

如果我使用 xpath('//article/a/@href') 那么我会得到 23 个网址.

If I use xpath('//article/a/@href') then I get 23 urls.

无需添加@class.没有其他文章.

如果我这样做

for item in sel.xpath('//article/@class').extract():
    print('class:', item)

然后我得到

class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even

所以有些项目在类字符串中有even,这是你的问题.

So some items have even in class string and this was your problem.

这篇关于python scrapy没有抓取抓取列表中的所有网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆