使用 Scrapy 和 Python 2.7 递归抓取 Craigslist [英] Recursive Scraping Craigslist with Scrapy and Python 2.7

查看:39
本文介绍了使用 Scrapy 和 Python 2.7 递归抓取 Craigslist的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法让蜘蛛跟踪下一页广告,而没有跟踪它找到的每个链接,最终返回每个 craigslist 页面.我玩过规则,因为我知道问题出在哪里,但我要么只得到第一页,craigslist 上的每一页,要么什么都没有.有什么帮助吗?

I'm having trouble getting the spider to follow the next page of ads without following every link it finds, eventually returning every craigslist page. I've played around with the rule as I know that's where the problem lies, but I either get just the first page, every page on craigslist, or nothing. Any help?

这是我当前的代码:

from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request

class PageSpider(CrawlSpider):
    name = "cto"
    allowed_domains = ["medford.craigslist.org"]
    start_urls = ["http://medford.craigslist.org/cto/"]

    rules = (
        Rule(
        SgmlLinkExtractor(allow_domains=("medford.craigslist.org", )),
        callback='parse_page', follow=True
        ),

    )

        def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//div[@class="content"]/p[@class="row"]')

        for row in rows:
            item = CraigslistSampleItem()
            link = row.xpath('.//span[@class="pl"]/a')
            item['title'] = link.xpath("text()").extract()
            item['link'] = link.xpath("@href").extract()
            item['price'] = row.xpath('.//span[@class="l2"]/span[@class="price"]/text()').extract()

            url = 'http://medford.craigslist.org{}'.format(''.join(item['link']))
            yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)


    def parse_item_page(self, response):
        hxs = HtmlXPathSelector(response)

        item = response.meta['item']
        item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
        return item

推荐答案

您应该指定 SgmlLinkExtractor:

You should specify an allow argument of SgmlLinkExtractor:

allow(正则表达式(或列表))——单个正则(绝对)网址的表达式(或正则表达式列表)必须匹配才能被提取.如果没有给出(或为空),它将匹配所有链接.

allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.

rules = (
    Rule(SgmlLinkExtractor(allow='http://medford.craigslist.org/cto/'), 
         callback='parse_page', follow=True),
)

这会将所有链接保留在 http://medford.craigslist.org/cto/ url 下.

This will keep all links under http://medford.craigslist.org/cto/ url.

希望有所帮助.

这篇关于使用 Scrapy 和 Python 2.7 递归抓取 Craigslist的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆