让 Scrapy 跟随链接并收集数据 [英] Make Scrapy follow links and collect data

查看:27
本文介绍了让 Scrapy 跟随链接并收集数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Scrapy 中编写程序来打开链接并从此标签收集数据:<p class="attrgroup"></p>.

I am trying to write program in Scrapy to open links and collect data from this tag: <p class="attrgroup"></p>.

我设法让 Scrapy 收集来自给定 URL 的所有链接,但不关注它们.非常感谢任何帮助.

I've managed to make Scrapy collect all the links from given URL but not to follow them. Any help is very appreciated.

推荐答案

您需要让出 Request 链接的实例,分配回调并提取所需 p 元素的文本回调:

You need to yield Request instances for the links to follow, assign a callback and extract the text of the desired p element in the callback:

# -*- coding: utf-8 -*-
import scrapy


# item class included here 
class DmozItem(scrapy.Item):
    # define the fields for your item here like:
    link = scrapy.Field()
    attr = scrapy.Field()


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["craigslist.org"]
    start_urls = [
    "http://chicago.craigslist.org/search/emd?"
    ]

    BASE_URL = 'http://chicago.craigslist.org/'

    def parse(self, response):
        links = response.xpath('//a[@class="hdrlnk"]/@href').extract()
        for link in links:
            absolute_url = self.BASE_URL + link
            yield scrapy.Request(absolute_url, callback=self.parse_attr)

    def parse_attr(self, response):
        item = DmozItem()
        item["link"] = response.url
        item["attr"] = "".join(response.xpath("//p[@class='attrgroup']//text()").extract())
        return item

这篇关于让 Scrapy 跟随链接并收集数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆