Scrapy:抓取链接列表 [英] Scrapy: scraping a list of links

查看:61
本文介绍了Scrapy:抓取链接列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是我之前问过的这个问题的后续.

This question is somewhat a follow-up of this question that I asked previously.

我正在尝试抓取一个网站,该网站在第一页上包含一些链接.类似于this.

I am trying to scrape a website which contains some links on the first page. Something similar to this.

现在,由于我想抓取页面上显示的项目的详细信息,因此我提取了它们各自的 URL.

Now, since I want to scrape the details of the items present on the page I have extracted their individual URLs.

我已将这些 URL 保存在列表中.

I have saved these URLS in a list.

如何启动蜘蛛来单独抓取页面?

How do I launch spiders to scrape the pages individually?

为了更好地理解:

[urlA, urlB, urlC, urlD...]

这是我抓取的 URL 列表.现在我想启动一个蜘蛛来单独抓取链接.

This is the list of URLs that I have scraped. Now I want to launch a spider to scrape the links individually.

我该怎么做?

推荐答案

我假设您要访问的 url 指向具有相同或相似结构的页面.如果是这种情况,您应该执行以下操作:

I'm assuming that the urls you want to follow lead to pages with the same or similar structure. If that's the case, you should do something like this:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):

   name = 'yourCrawler'
   allowed_domains = 'domain.com'
   start_urls = ["htttp://www.domain.com/example/url"]


   def parse(self, response):
      #parse any elements you need from the start_urls and, optionally, store them as Items.
      # See http://doc.scrapy.org/en/latest/topics/items.html

      s = Selector(response)
      urls = s.xpath('//div[@id="example"]//a/@href').extract()
      for url in urls:
         yield Request(url, callback=self.parse_following_urls, dont_filter=True)


   def parse_following_urls(self, response):
       #Parsing rules go here

否则,如果您要跟踪的 url 指向具有不同结构的页面,那么您可以为它们定义特定的方法(例如 parse1、parse2、parse3...).

Otherwise, if urls you want to follow lead to pages with different structure, then you can define specific methods for them (something like parse1, parse2, parse3...).

这篇关于Scrapy:抓取链接列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆