在scrapy框架python中为start_urls列表中的url构建正则表达式 [英] Constructing a regular expression for url in start_urls list in scrapy framework python

查看：47 发布时间：2021/7/16 21:54:59 python scrapy

本文介绍了在scrapy框架python中为start_urls列表中的url构建正则表达式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对scrapy很陌生，而且我之前也没有使用过正则表达式

I am very new to scrapy and also i didn't used regular expressions before

以下是我的spider.py代码

class ExampleSpider(BaseSpider):
   name = "test_code
   allowed_domains = ["www.example.com"]
   start_urls = [
       "http://www.example.com/bookstore/new/1?filter=bookstore",
       "http://www.example.com/bookstore/new/2?filter=bookstore",
       "http://www.example.com/bookstore/new/3?filter=bookstore",
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)

现在，如果我们查看 start_urls，所有三个 url 都是相同的，只是它们在整数值 2?, 3? 上有所不同，依此类推，我的意思是根据存在的 url 是无限制的在网站上，我现在可以使用 crawlspider，我们可以为 URL 构建正则表达式，如下所示，

Now if we look at start_urls all the three urls are same except they differ at integer value 2?, 3? and so on i mean unlimited according to urls present on the site , i now that we can use crawlspider and we can construct regular expression for the URL like below,

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    import re

    class ExampleSpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = [
       "http://www.example.com/bookstore/new/1?filter=bookstore",
       "http://www.example.com/bookstore/new/2?filter=bookstore",
       "http://www.example.com/bookstore/new/3?filter=bookstore",
   ]

        rules = (
            Rule(SgmlLinkExtractor(allow=(........),))),
        ) 

   def parse(self, response):
       hxs = HtmlXPathSelector(response)

请指导我如何为上述 start_url 列表构建爬行蜘蛛规则.

can u please guide me , that how can i construct a crawl spider Rule for the above start_url list.

推荐答案

如果我理解正确的话，你会想要很多具有特定模式的起始 URL.

If i understand you correctly, you want a lot of start URL with a certain pattern.

如果是这样，您可以覆盖 BaseSpider.start_requests 方法:

If so, you can override BaseSpider.start_requests method:

class ExampleSpider(BaseSpider):
    name = "test_code"
    allowed_domains = ["www.example.com"]

    def start_requests(self):
        for i in xrange(1000):
            yield self.make_requests_from_url("http://www.example.com/bookstore/new/%d?filter=bookstore" % i)

    ...

这篇关于在scrapy框架python中为start_urls列表中的url构建正则表达式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在scrapy框架python中为start_urls列表中的url构建正则表达式 [英] Constructing a regular expression for url in start_urls list in scrapy framework python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在scrapy框架python中为start_urls列表中的url构建正则表达式 [英] Constructing a regular expression for url in start_urls list in scrapy framework python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭