scrapy - 解析分页的项目 [英] scrapy - parsing items that are paginated

查看:41
本文介绍了scrapy - 解析分页的项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个形式的网址:

example.com/foo/bar/page_1.html

共有 53 页,每页约 20 行.

There are a total of 53 pages, each one of them has ~20 rows.

我基本上想从所有页面中获取所有行,即 ~53*20 项.

I basically want to get all the rows from all the pages, i.e. ~53*20 items.

我的 parse 方法中有工作代码,它解析单个页面,每个项目还深入一页,以获取有关该项目的更多信息:

I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:

  def parse(self, response):
    hxs = HtmlXPathSelector(response)

    restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')

    for rest in restaurants:
      item = DegustaItem()
      item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
      # some items don't have category associated with them
      try:
        item['category'] = rest.select('td[3]/a/text()').extract()[0]
      except:
        item['category'] = ''
      item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]

      # get profile url
      rel_url = rest.select('td[2]/a/@href').extract()[0]
      # join with base url since profile url is relative
      base_url = get_base_url(response)
      follow = urljoin_rfc(base_url,rel_url)

      request = Request(follow, callback = parse_profile)
      request.meta['item'] = item
      return request


  def parse_profile(self, response):
    item = response.meta['item']
    # item['address'] = figure out xpath
    return item

问题是,我如何抓取每个页面?

The question is, how do I crawl each page?

example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html

推荐答案

您有两种选择来解决您的问题.一般的一种是使用yield来生成新的请求,而不是return.这样,您可以从单个回调发出多个新请求.在 http://doc.scrapy.org/en/latest/topics 查看第二个示例/spider.html#basespider-example.

You have two options to solve your problem. The general one is to use yield to generate new requests instead of return. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.

在您的情况下,可能有一个更简单的解决方案:只需从这样的模式中生成 start urs 列表:

In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:

class MySpider(BaseSpider):
    start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]

这篇关于scrapy - 解析分页的项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆