在scrapy框架python中为start_urls列表中的url构建正则表达式 [英] Constructing a regular expression for url in start_urls list in scrapy framework python
问题描述
我对scrapy很陌生,而且我之前也没有使用过正则表达式
I am very new to scrapy and also i didn't used regular expressions before
以下是我的spider.py
代码
class ExampleSpider(BaseSpider):
name = "test_code
allowed_domains = ["www.example.com"]
start_urls = [
"http://www.example.com/bookstore/new/1?filter=bookstore",
"http://www.example.com/bookstore/new/2?filter=bookstore",
"http://www.example.com/bookstore/new/3?filter=bookstore",
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
现在,如果我们查看 start_urls
,所有三个 url 都是相同的,只是它们在整数值 2?, 3?
上有所不同,依此类推,我的意思是根据存在的 url 是无限制的在网站上,我现在可以使用 crawlspider,我们可以为 URL 构建正则表达式,如下所示,
Now if we look at start_urls
all the three urls are same except they differ at integer value 2?, 3?
and so on i mean unlimited according to urls present on the site , i now that we can use crawlspider and we can construct regular expression for the URL like below,
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re
class ExampleSpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
"http://www.example.com/bookstore/new/1?filter=bookstore",
"http://www.example.com/bookstore/new/2?filter=bookstore",
"http://www.example.com/bookstore/new/3?filter=bookstore",
]
rules = (
Rule(SgmlLinkExtractor(allow=(........),))),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
请指导我如何为上述 start_url
列表构建爬行蜘蛛规则.
can u please guide me , that how can i construct a crawl spider Rule for the above start_url
list.
推荐答案
如果我理解正确的话,你会想要很多具有特定模式的起始 URL.
If i understand you correctly, you want a lot of start URL with a certain pattern.
如果是这样,您可以覆盖 BaseSpider.start_requests 方法:
If so, you can override BaseSpider.start_requests method:
class ExampleSpider(BaseSpider):
name = "test_code"
allowed_domains = ["www.example.com"]
def start_requests(self):
for i in xrange(1000):
yield self.make_requests_from_url("http://www.example.com/bookstore/new/%d?filter=bookstore" % i)
...
这篇关于在scrapy框架python中为start_urls列表中的url构建正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!