使用 Scrapy 抓取链接 [英] Scraping links with Scrapy
问题描述
我正在尝试抓取一个瑞典房地产网站 www.booli.se.但是,我不知道如何跟踪每个房子的链接并提取例如价格、房间、年龄等.我只知道如何抓取一页,我似乎无法解决这个问题.我正在做类似的事情:
I am trying to scrape a swedish real estate website www.booli.se . However, i can't figure out how to follow links for each house and extract for example price, rooms, age etc. I only know how to scrape one page and i can't seem to wrap my head around this. I am looking to do something like:
for link in website:
follow link
attribute1 = item.css('cssobject::text').extract()[1]
attribute2 = item.ss('cssobject::text').extract()[2]
yield{'Attribute 1': attribute1, 'Attribute 2': attribute2}
这样我就可以抓取数据并将其输出到 excel 文件中.我在没有以下链接的情况下抓取简单页面的代码如下:
So that i can scrape the data and output it to an excel-file. My code for scraping a simple page without following links is as follows:
import scrapy
class BooliSpider(scrapy.Spider):
name = "boolidata"
start_urls = [
'https://www.booli.se/slutpriser/lund/116978/'
]
'''def parse(self, response):
for link in response.css('.nav-list a::attr(href)').extract():
yield scrapy.Request(url=response.urljoin(link),
callback=self.collect_data)'''
def parse(self, response):
for item in response.css('li.search-list__item'):
size = item.css('span.search-list__row::text').extract()[1]
price = item.css('span.search-list__row::text').extract()[3]
m2price = item.css('span.search-list__row::text').extract()[4]
yield {'Size': size, 'Price': price, 'M2price': m2price}
感谢您的帮助.真的很难将所有内容整合在一起并将特定链接内容输出到一个有凝聚力的输出文件 (excel).
Thankful for any help. Really having trouble getting it all together and outputting specific link contents to a cohesive output-file (excel).
推荐答案
你可以使用scrapy的CrawlSpider 用于关注和抓取链接
You could use scrapy's CrawlSpider for following and scraping links
您的代码应如下所示:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule
class BooliItem(scrapy.Item):
size = scrapy.Field()
price = scrapy.Field()
m2price = scrapy.Field()
class BooliSpider(CrawlSpider):
name = "boolidata"
start_urls = [
'https://www.booli.se/slutpriser/lund/116978/',
]
rules = [
Rule(
LinkExtractor(
allow=(r'listing url pattern here to follow'),
deny=(r'other url patterns to deny'),
),
callback='parse_item',
follow=True,
),
]
def parse_item(self, response):
item = BooliItem()
item['size'] = response.css('size selector').extract()
item['price'] = response.css('price selector').extract()
item['m2price'] = response.css('m2price selector').extract()
return item
您可以通过以下方式运行您的代码:
And you can run your code via:
scrapy crawl booli -o booli.csv
并将您的 csv 导入 Excel.
and import your csv to Excel.
这篇关于使用 Scrapy 抓取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!