Scrapy 抓取并关注 href 中的链接 [英] Scrapy crawl and follow links within href

查看:57
本文介绍了Scrapy 抓取并关注 href 中的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对scrapy非常陌生.我需要从一个 url 的主页跟随 href 到多个深度.再次在href 链接中,我有多个href.我需要按照这些 href 进行操作,直到到达我想要抓取的页面为止.我的页面的示例 html 是:

初始页面

<a class="menu" href="/abc.html"><a class="menu" href="/def.html">

在 abc.html 内

<div class="cell-text category"><p class="t"><a id="cat-24887" href="fgh.html"/></p>

我需要从这个 fgh.html 页面抓取内容.任何人都可以请建议我从哪里开始.我阅读了 Linkextractors,但找不到合适的参考资料.谢谢你

解决方案

就我所见,我可以说:

  • 产品类别的 URL 总是以 .kat
  • 结尾
  • 产品网址包含 id_ 后跟一组数字

让我们使用这些信息来定义我们的蜘蛛 规则:

from scrapy.contrib.spiders import CrawlSpider, Rule从 scrapy.contrib.linkextractors 导入 LinkExtractor类 CodeCheckspider(CrawlSpider):名称 = "code_check"allowed_domains = ["www.codecheck.info"]start_urls = ['http://www.codecheck.info/']规则 = [规则(LinkExtractor(allow=r'\.kat$'), follow=True),规则(LinkExtractor(allow=r'/id_\d+/'), callback='parse_product'),]def parse_product(self, response):title = response.xpath('//title/text()').extract()[0]印刷标题

换句话说,我们要求蜘蛛跟踪每个类别链接,并在它抓取包含 id_ 的链接时让我们知道 - 这对我们来说意味着我们找到了一个产品 - 在这种情况下,举个例子,我在控制台上打印页面标题.这应该会给你一个很好的起点.

I am very much new to scrapy. I need to follow href from the homepage of a url to multiple depths. Again inside the href links i've multiple href's. I need to follow these href until i reach my desired page to scrape. The sample html of my page is:

Initial Page

<div class="page-categories">
 <a class="menu"  href="/abc.html">
 <a class="menu"  href="/def.html">
</div>

Inside abc.html

<div class="cell category" >
 <div class="cell-text category">
 <p class="t">
  <a id="cat-24887" href="fgh.html"/>
</p>
</div>

I need to scrape the contents from this fgh.html page. Could anyone please suggest me where to start from. I read about Linkextractors but could not find a suitable reference to begin with. Thankyou

解决方案

From what I see, I can say that:

  • URLs to product categories always end with .kat
  • URLs to products contain id_ followed by a set of digits

Let's use this information to define our spider rules:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor


class CodeCheckspider(CrawlSpider):
    name = "code_check"

    allowed_domains = ["www.codecheck.info"]
    start_urls = ['http://www.codecheck.info/']

    rules = [
        Rule(LinkExtractor(allow=r'\.kat$'), follow=True),
        Rule(LinkExtractor(allow=r'/id_\d+/'), callback='parse_product'),
    ]

    def parse_product(self, response):
        title = response.xpath('//title/text()').extract()[0]
        print title

In other words, we are asking spider to follow every category link and to let us know when it crawls a link containing id_ - which would mean for us that we found a product - in this case, for the sake of an example, I'm printing the page title on the console. This should give you a good starting point.

这篇关于Scrapy 抓取并关注 href 中的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆