如何使用scrapy抓取多个页面?(两级) [英] How to use scrapy to crawl multiple pages? (two level)

查看:55
本文介绍了如何使用scrapy抓取多个页面?(两级)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的网站上,我创建了两个简单的页面:这是他们的第一个 html 脚本:

On my site I created two simple pages: Here are their first html script:

test1.html:

test1.html :

<head>
<title>test1</title>
</head>
<body>
<a href="test2.html" onclick="javascript:return xt_click(this, "C", "1", "Product", "N");" indepth="true">
<span>cool</span></a>
</body></html>

test2.html:

test2.html :

<head>
<title>test2</title>
</head>
<body></body></html>

我想在两个页面的标题标签中抓取文本.这里是test1"和test2".但我是scrapy的新手,我只碰巧只抓取了第一页.我的草稿脚本:

I want scraping text in the title tag of the two pages.here is "test1" and "test2". but I am a novice with scrapy I only happens scraping only the first page. my scrapy script:

from scrapy.spider import Spider
from scrapy.selector import Selector

from testscrapy1.items import Website

class DmozSpider(Spider):
name = "bill"
allowed_domains = ["http://exemple.com"]
start_urls = [
    "http://www.exemple.com/test1.html"
]


def parse(self, response):

    sel = Selector(response)
    sites = sel.xpath('//head')
    items = []

    for site in sites:
        item = Website()

        item['title'] = site.xpath('//title/text()').extract()

        items.append(item)

    return items

如何通过onclik?以及如何成功抓取第二页标题标签的文本?先感谢您STEF

How to pass the onclik? and how to successfully scraping the text of the title tag of the second page? Thank you in advance STEF

推荐答案

要在您的代码中使用多个函数,发送多个请求并解析它们,您将需要:1)yield 而不是 return,2)回调.

To use multiple functions in your code, send multiple requests and parse them, you're going to need: 1) yield instead of return, 2) callback.

示例:

def parse(self,response):
    for site in response.xpath('//head'):
        item = Website()
        item['title'] = site.xpath('//title/text()').extract()
        yield item
    yield scrapy.Request(url="http://www.domain.com", callback=self.other_function)

def other_function(self,response):
    for other_thing in response.xpath('//this_xpath')
        item = Website()
        item['title'] = other_thing.xpath('//this/and/that').extract()
        yield item

你不能用scrapy解析javascript,但你可以理解javascript的作用并做同样的事情:http://doc.scrapy.org/en/latest/topics/firebug.html

You cannot parse javascript with scrapy, but you can understand what the javascript does and do the same: http://doc.scrapy.org/en/latest/topics/firebug.html

这篇关于如何使用scrapy抓取多个页面?(两级)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆