Scrapy 抓取并关注 href 中的链接 [英] Scrapy crawl and follow links within href
问题描述
我对scrapy非常陌生.我需要从一个 url 的主页跟随 href 到多个深度.再次在href 链接中,我有多个href.我需要按照这些 href 进行操作,直到到达我想要抓取的页面为止.我的页面的示例 html 是:
初始页面
<a class="menu" href="/abc.html"><a class="menu" href="/def.html">
在 abc.html 内
<div class="cell-text category"><p class="t"><a id="cat-24887" href="fgh.html"/></p>
我需要从这个 fgh.html 页面抓取内容.任何人都可以请建议我从哪里开始.我阅读了 Linkextractors,但找不到合适的参考资料.谢谢你
就我所见,我可以说:
- 产品类别的 URL 总是以
.kat
结尾 - 产品网址包含
id_
后跟一组数字
让我们使用这些信息来定义我们的蜘蛛 规则
:
from scrapy.contrib.spiders import CrawlSpider, Rule从 scrapy.contrib.linkextractors 导入 LinkExtractor类 CodeCheckspider(CrawlSpider):名称 = "code_check"allowed_domains = ["www.codecheck.info"]start_urls = ['http://www.codecheck.info/']规则 = [规则(LinkExtractor(allow=r'\.kat$'), follow=True),规则(LinkExtractor(allow=r'/id_\d+/'), callback='parse_product'),]def parse_product(self, response):title = response.xpath('//title/text()').extract()[0]印刷标题
换句话说,我们要求蜘蛛跟踪每个类别链接,并在它抓取包含 id_
的链接时让我们知道 - 这对我们来说意味着我们找到了一个产品 - 在这种情况下,举个例子,我在控制台上打印页面标题.这应该会给你一个很好的起点.
I am very much new to scrapy. I need to follow href from the homepage of a url to multiple depths. Again inside the href links i've multiple href's. I need to follow these href until i reach my desired page to scrape. The sample html of my page is:
Initial Page
<div class="page-categories">
<a class="menu" href="/abc.html">
<a class="menu" href="/def.html">
</div>
Inside abc.html
<div class="cell category" >
<div class="cell-text category">
<p class="t">
<a id="cat-24887" href="fgh.html"/>
</p>
</div>
I need to scrape the contents from this fgh.html page. Could anyone please suggest me where to start from. I read about Linkextractors but could not find a suitable reference to begin with. Thankyou
From what I see, I can say that:
- URLs to product categories always end with
.kat
- URLs to products contain
id_
followed by a set of digits
Let's use this information to define our spider rules
:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class CodeCheckspider(CrawlSpider):
name = "code_check"
allowed_domains = ["www.codecheck.info"]
start_urls = ['http://www.codecheck.info/']
rules = [
Rule(LinkExtractor(allow=r'\.kat$'), follow=True),
Rule(LinkExtractor(allow=r'/id_\d+/'), callback='parse_product'),
]
def parse_product(self, response):
title = response.xpath('//title/text()').extract()[0]
print title
In other words, we are asking spider to follow every category link and to let us know when it crawls a link containing id_
- which would mean for us that we found a product - in this case, for the sake of an example, I'm printing the page title on the console. This should give you a good starting point.
这篇关于Scrapy 抓取并关注 href 中的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!