Scrapy 递归下载内容 [英] Scrapy Recursive download of Content

查看:58
本文介绍了Scrapy 递归下载内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

敲了几下头,我终于来了.

After banging my head several time, I am finally coming here.

问题:我正在尝试下载每个 craiglist 帖子的内容.我所说的内容指的是发布正文",如手机的描述. 寻找一部新的旧手机,因为 iPhone 已经完成了所有的兴奋.

Problem : I am trying to download the content of each of the craiglist posting. By content I mean the "posting body" like description of the cell phone. Looking for a new old phone since iPhone is done with all excitement.

代码是 Michael Herman 的出色作品.

The code is an awesome work by Michael Herman.

我的蜘蛛类

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import *
from craig.items import CraiglistSampleItem

class MySpider(CrawlSpider):
    name = "craigs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://minneapolis.craigslist.org/moa/"]

    rules = (Rule (SgmlLinkExtractor(allow=("index\d00\.html", ),restrict_xpaths=('//p[@class="nextpage"]',))
    , callback="parse_items", follow= True),
    )

    def parse_items(self,response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//span[@class='pl']")
        items = []
        for titles in titles:
            item = CraiglistSampleItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

和 Item 类

from scrapy.item import Item, Field

class CraiglistSampleItem(Item):
    title = Field()
    link = Field()

由于代码会遍历很多链接,因此我想将每个手机的描述保存在 sepearte csv 中,但在 csv 中再多列一列也可以.

Since the code will traverse many links , hence I wanted to save the description of each cell phone in sepearte csv but one more column in csv will be fine also.

任何线索!!!

推荐答案

而不是在 parse_items 方法中返回项目,你应该返回/产生scrapy Request 实例以从项目页面获取描述,linktitle 你可以在 Item 内部传递,而 Itemmeta 字典:

Instead of returning items in parse_items method you should return/yield scrapy Request instance in order to get the description from the item page, link and title you can pass inside of an Item, and Item inside of the meta dictionary:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *

from scrapy.item import Item, Field


class CraiglistSampleItem(Item):
    title = Field()
    link = Field()
    description = Field()


class MySpider(CrawlSpider):
    name = "craigs"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://minneapolis.craigslist.org/moa/"]

    rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//p[@class="nextpage"]',))
        , callback="parse_items", follow=True),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)

        titles = hxs.select("//span[@class='pl']")
        for title in titles:
            item = CraiglistSampleItem()
            item["title"] = title.select("a/text()").extract()[0]
            item["link"] = title.select("a/@href").extract()[0]

            url = "http://minneapolis.craigslist.org%s" % item["link"]
            yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)

    def parse_item_page(self, response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']
        item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
        return item

运行它并查看输出 csv 文件中的附加 description 列.

Run it and see additional description column in your output csv file.

希望有所帮助.

这篇关于Scrapy 递归下载内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆