Scrapy 中的 start_urls [英] start_urls in Scrapy

查看:32
本文介绍了Scrapy 中的 start_urls的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从这个网站获取一些信息:http://www.go-on.fi/tyopaikat.如您所见,该表格有分页,因此每当您单击第二页或第三页时,链接也会发生变化 http://www.go-on.fi/tyopaikat?start=20(以start="结尾).这是我的蜘蛛代码:

I am trying to fetch some information from this website: http://www.go-on.fi/tyopaikat. As you can see, the table has a pagination, so whenever you click second or third page, the link will change too something http://www.go-on.fi/tyopaikat?start=20 (with the "start=" at the end). This is my spider code:

allowed_domains = ["go-on.fi"]
start_urls = ["http://www.go-on.fi/tyopaikat?start=0"]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items = []
    titles = hxs.select("//tr")

    for row in titles:
        item = JobData()
        item['header'] = row.select("./td[1]/a/text()").extract()
        item['link'] = row.select("./td[1]/a/@href").extract()
        items.append(item)

所以我的问题是,我怎样才能让蜘蛛浏览网站的每一页(我的意思是表格页面)?

So my question is, how can I make the spider go through every page of the website (I mean the table page)?

推荐答案

您可以做的是将 start_urls 设置为主页,然后根据页脚分页中显示的页数(在本例中为 3),使用循环为每个页面创建一个 yield 请求:

What you could do is set the start_urls to the main page then based on the number of pages shown in the footer pagination (in this case 3), use a loop to create a yield Request for each of the pages:

allowed_domains = ["go-on.fi"]
start_urls = ["http://www.go-on.fi/tyopaikat"]

def parse(self, response):
    pages = response.xpath('//ul[@class="pagination"][last()-1]/a/text()').extract()
    page = 1
    start = 0
    while page <= pages:
        url = "http://www.go-on.fi/tyopaikat?start="+str(start)
        start += 20
        page += 1
        yield Request(url, callback=self.parse_page)

def parse_page(self,response):
    hxs = HtmlXPathSelector(response)
    items = []
    titles = hxs.select("//tr")

    for row in titles:
        item = JobData()
        item['header'] = row.select("./td[1]/a/text()").extract()
        item['link'] = row.select("./td[1]/a/@href").extract()
        items.append(item)

这篇关于Scrapy 中的 start_urls的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆