Scrapy Python Craigslist Scraper [英] Scrapy Python Craigslist Scraper

查看:116
本文介绍了Scrapy Python Craigslist Scraper的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Scrapy 抓取Craigslist分类信息,以提取待售物品.

我能够提取日期,帖子标题和帖子url ,但是在提取价格时遇到了麻烦.

由于某种原因,当前代码会提取全部个价格,但是当我在查找价格跨度之前删除//时,价格字段将返回为空./p>

有人可以查看下面的代码并帮助我吗?

from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from craigslist_sample.items import CraigslistSampleItem

    class MySpider(BaseSpider):
        name = "craig"
        allowed_domains = ["craigslist.org"]
        start_urls = ["http://longisland.craigslist.org/search/sss?sort=date&query=raptor%20660&srchType=T"]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select("//p")
    items = []
    for titles in titles:
        item = CraigslistSampleItem()
        item['date'] = titles.select('span[@class="itemdate"]/text()').extract()
        item ["title"] = titles.select("a/text()").extract()
        item ["link"] = titles.select("a/@href").extract()
        item ['price'] = titles.select('//span[@class="itempp"]/text()').extract()
        items.append(item)
    return items

解决方案

itempp似乎在另一个元素itempnr中.如果将//span[@class="itempp"]/text()更改为span[@class="itempnr"]/span[@class="itempp"]/text(),也许会起作用.

I am trying to scrape Craigslist classifieds using Scrapy to extract items that are for sale.

I am able to extract date, post title, and post url but am having trouble extracting price.

For some reason the current code extracts all of the prices, but when I remove the // before the price span look up the price field returns as empty.

Can someone please review the code below and help me out?

from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from craigslist_sample.items import CraigslistSampleItem

    class MySpider(BaseSpider):
        name = "craig"
        allowed_domains = ["craigslist.org"]
        start_urls = ["http://longisland.craigslist.org/search/sss?sort=date&query=raptor%20660&srchType=T"]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select("//p")
    items = []
    for titles in titles:
        item = CraigslistSampleItem()
        item['date'] = titles.select('span[@class="itemdate"]/text()').extract()
        item ["title"] = titles.select("a/text()").extract()
        item ["link"] = titles.select("a/@href").extract()
        item ['price'] = titles.select('//span[@class="itempp"]/text()').extract()
        items.append(item)
    return items

解决方案

itempp appears to be inside of another element, itempnr. Perhaps it would work if you were to change //span[@class="itempp"]/text() to span[@class="itempnr"]/span[@class="itempp"]/text().

这篇关于Scrapy Python Craigslist Scraper的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆