crawlSpider 似乎不遵守规则 [英] crawlSpider seems not to follow rule
问题描述
这是我的代码.实际上我按照使用 Scrapy 递归抓取网页",看来我在某处包含了一个错误.
here's my code. Actually I followed the example in "Recursively Scraping Web Pages With Scrapy" and it seems I have included a mistake somewhere.
有人可以帮我找到吗?这让我发疯了,我只想要所有结果页面的所有结果.相反,它给了我第 1 页的结果.
Can someone help me find it, please? It's driving me crazy, I only want all the results from all the result pages. Instead it gives me the results from page 1.
这是我的代码:
import scrapy
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http.request import Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from githubScrape.items import GithubscrapeItem
class GithubSpider(CrawlSpider):
name = "github2"
allowed_domains = ["github.com"]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[contains(@class, "next_page")]')), callback='parse_items', follow=True),
)
def start_requests(self):
baseURL = 'https://github.com/search?utf8=%E2%9C%93&q=eagle+SYSTEM+extension%3Asch+size%3A'
for i in range(10000, 20000, +5000):
url = baseURL+str(i+1)+".."+str(i+5000)+'&type=Code&ref=searchresults'
print "URL:",url
yield Request(url, callback=self.parse_items)
def parse_items(self, response):
hxs = Selector(response)
resultParagraphs = hxs.xpath('//div[contains(@id,"code_search_results")]//p[contains(@class, "title")]')
items = []
for p in resultParagraphs:
hrefs = p.xpath('a/@href').extract()
projectURL = hrefs[0]
schemeURL = hrefs[1]
lastIndexedOn = p.xpath('.//span/time/@datetime').extract()
i = GithubscrapeItem()
i['counter'] = self.count
i['projectURL'] = projectURL
i['schemeURL'] = schemeURL
i['lastIndexedOn'] = lastIndexedOn
items.append(i)
return(items)
推荐答案
我在您传递的链接上没有找到您的代码,但我认为问题在于您从不使用规则.
I didn't find your code on the link you passed, but I think the problem is that you are never using the rules.
Scrapy 通过调用 start_requests
方法开始爬行,但规则被编译并在 parse
方法上使用,您没有使用该方法,因为您的请求直接来自 start_requests
到 parse_items
.
Scrapy starts crawling by calling the start_requests
method, but the rules are compiled and used on the parse
method, which you are not using because your requests go directly from start_requests
to parse_items
.
如果您希望在该级别应用规则,您可以删除 start_requests
方法上的 callback
.
You could remove the callback
on the start_requests
method if you want the rules to be applied on that level.
这篇关于crawlSpider 似乎不遵守规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!