Scrapy Python Craigslist Scraper [英] Scrapy Python Craigslist Scraper
问题描述
我正在尝试使用 Scrapy 抓取Craigslist分类信息,以提取待售物品.
我能够提取日期,帖子标题和帖子url ,但是在提取价格时遇到了麻烦.
由于某种原因,当前代码会提取全部个价格,但是当我在查找价格跨度之前删除//时,价格字段将返回为空./p>
有人可以查看下面的代码并帮助我吗?
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["http://longisland.craigslist.org/search/sss?sort=date&query=raptor%20660&srchType=T"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
items = []
for titles in titles:
item = CraigslistSampleItem()
item['date'] = titles.select('span[@class="itemdate"]/text()').extract()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
item ['price'] = titles.select('//span[@class="itempp"]/text()').extract()
items.append(item)
return items
itempp
似乎在另一个元素itempnr
中.如果将//span[@class="itempp"]/text()
更改为span[@class="itempnr"]/span[@class="itempp"]/text()
,也许会起作用.
I am trying to scrape Craigslist classifieds using Scrapy to extract items that are for sale.
I am able to extract date, post title, and post url but am having trouble extracting price.
For some reason the current code extracts all of the prices, but when I remove the // before the price span look up the price field returns as empty.
Can someone please review the code below and help me out?
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["http://longisland.craigslist.org/search/sss?sort=date&query=raptor%20660&srchType=T"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
items = []
for titles in titles:
item = CraigslistSampleItem()
item['date'] = titles.select('span[@class="itemdate"]/text()').extract()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
item ['price'] = titles.select('//span[@class="itempp"]/text()').extract()
items.append(item)
return items
itempp
appears to be inside of another element, itempnr
. Perhaps it would work if you were to change //span[@class="itempp"]/text()
to span[@class="itempnr"]/span[@class="itempp"]/text()
.
这篇关于Scrapy Python Craigslist Scraper的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!