如何使用scrapy抓取多个页面?(两级) [英] How to use scrapy to crawl multiple pages? (two level)
问题描述
在我的网站上,我创建了两个简单的页面:这是他们的第一个 html 脚本:
On my site I created two simple pages: Here are their first html script:
test1.html:
test1.html :
<head>
<title>test1</title>
</head>
<body>
<a href="test2.html" onclick="javascript:return xt_click(this, "C", "1", "Product", "N");" indepth="true">
<span>cool</span></a>
</body></html>
test2.html:
test2.html :
<head>
<title>test2</title>
</head>
<body></body></html>
我想在两个页面的标题标签中抓取文本.这里是test1"和test2".但我是scrapy的新手,我只碰巧只抓取了第一页.我的草稿脚本:
I want scraping text in the title tag of the two pages.here is "test1" and "test2". but I am a novice with scrapy I only happens scraping only the first page. my scrapy script:
from scrapy.spider import Spider
from scrapy.selector import Selector
from testscrapy1.items import Website
class DmozSpider(Spider):
name = "bill"
allowed_domains = ["http://exemple.com"]
start_urls = [
"http://www.exemple.com/test1.html"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//head')
items = []
for site in sites:
item = Website()
item['title'] = site.xpath('//title/text()').extract()
items.append(item)
return items
如何通过onclik?以及如何成功抓取第二页标题标签的文本?先感谢您STEF
How to pass the onclik? and how to successfully scraping the text of the title tag of the second page? Thank you in advance STEF
推荐答案
要在您的代码中使用多个函数,发送多个请求并解析它们,您将需要:1)yield 而不是 return,2)回调.
To use multiple functions in your code, send multiple requests and parse them, you're going to need: 1) yield instead of return, 2) callback.
示例:
def parse(self,response):
for site in response.xpath('//head'):
item = Website()
item['title'] = site.xpath('//title/text()').extract()
yield item
yield scrapy.Request(url="http://www.domain.com", callback=self.other_function)
def other_function(self,response):
for other_thing in response.xpath('//this_xpath')
item = Website()
item['title'] = other_thing.xpath('//this/and/that').extract()
yield item
你不能用scrapy解析javascript,但你可以理解javascript的作用并做同样的事情:http://doc.scrapy.org/en/latest/topics/firebug.html
You cannot parse javascript with scrapy, but you can understand what the javascript does and do the same: http://doc.scrapy.org/en/latest/topics/firebug.html
这篇关于如何使用scrapy抓取多个页面?(两级)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!