Crawled pages 和 Scraped items 之间的 Scrapy Spider 区别 [英] Scrapy spider difference between Crawled pages and Scraped items
问题描述
我正在编写一个 Scrapy CrawlSpider,它读取第一页上的 AD 列表,获取一些信息,例如列表和 AD 网址的拇指,然后向每个 AD 网址发出请求以获取其详细信息.
Im writing a Scrapy CrawlSpider that reads a list of ADs on first page, takes some info like thumbs of the listings and AD urls, then yields a request to each of this AD urls to take their details.
它在测试环境中工作和分页显然很好,但今天试图进行完整运行我意识到在日志中:
It was working and paginating apparently well on test enviroment, but today trying to make a complete run I realized that in log:
抓取 3852 页(以 228 页/分钟),抓取 256 个项目(以 15 个项目/分钟)
Crawled 3852 pages (at 228 pages/min), scraped 256 items (at 15 items/min)
我不明白抓取页面和已抓取项目之间存在如此大差异的原因.任何人都可以帮助我了解这些物品丢失的位置?
I'm not understanding the reason of this big difference between Crawled pages and Scraped items. Anybody can help me to realize where that items are getting lost?
我的蜘蛛代码:
class MySpider(CrawlSpider):
name = "myspider"
allowed_domains = ["myspider.com", "myspider.co"]
start_urls = [
"http://www.myspider.com/offers/myCity/typeOfAd/?search=fast",
]
#Pagination
rules = (
Rule (
SgmlLinkExtractor()
, callback='parse_start_url', follow= True),
)
#1st page
def parse_start_url(self, response):
hxs = HtmlXPathSelector(response)
next_page = hxs.select("//a[@class='pagNext']/@href").extract()
offers = hxs.select("//div[@class='hlist']")
for offer in offers:
myItem = myItem()
myItem['url'] = offer.select('.//span[@class="location"]/a/@href').extract()[0]
myItem['thumb'] = oferta.select('.//div[@class="itemFoto"]/div/a/img/@src').extract()[0]
request = Request(myItem['url'], callback = self.second_page)
request.meta['myItem'] = myItem
yield request
if next_page:
yield Request(next_page[0], callback=self.parse_start_url)
def second_page(self,response):
myItem = response.meta['myItem']
loader = myItemLoader(item=myItem, response=response)
loader.add_xpath('address', '//span[@itemprop="streetAddress"]/text()')
return loader.load_item()
推荐答案
假设你去你的第一个 start_urls
(实际上你只有一个)并且在这个页面上只有一个锚链接().因此,您的蜘蛛抓取此链接中的
href
url,您可以在回调 parse_start_url
中获得控制权.在此页面内,您有 5000 个带有 hlist
类的 div.假设所有 5000 个后续 URL 都返回 404,未找到.
Let's say you go to your first start_urls
(actually you only have one) and on this page there is only one anchor link (<a>
). So your spider crawls the href
url in this link and you get control in your callback, parse_start_url
. And inside of this page you have 5000 div's with an hlist
class. And let's suppose all 5000 of these subsequent URLs were returned 404, not found.
在这种情况下,您将:
- 抓取的页面数:5001
- 抓取的项目:0
再举一个例子:在您的起始 url 页面上,您有 5000 个锚点,但这些页面中没有一个(如零)具有任何类参数为 hlist
的 div.
Let's take another example: on your start url page you have 5000 anchors, but none (as in zero) of those pages have any divs with a class parameter of hlist
.
在这种情况下,您将:
- 抓取的页面数:5001
- 抓取的项目:0
您的答案在于 DEBUG 日志输出.
Your answer lies in the DEBUG log output.
这篇关于Crawled pages 和 Scraped items 之间的 Scrapy Spider 区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!