Scrapy - 蜘蛛抓取重复的网址 [英] Scrapy - Spider crawls duplicate urls

查看:48
本文介绍了Scrapy - 蜘蛛抓取重复的网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取搜索结果页面并从同一页面抓取标题和链接信息.作为搜索页面,我也有指向下一页的链接,我已在 SgmlLinkExtractor 中指定允许.

I'm crawling a search results page and scrape title and link information from the same page. As its a Search page, I have the links to the next pages as well, which I have specified in the SgmlLinkExtractor to allow.

问题的描述是,在第一页,我找到了第2页和第3页的链接来爬取,而且做得很好.但是当它抓取第二页时,它再次链接到 Page1(上一页)和 Page3(下一页).所以它再次爬行 Page1,引用为 Page2 并进入循环.

The description of the problem is, In 1st page, i have found the links of Page2 and Page3 to crawl and it does perfectly. But when it is crawls 2nd page, it again has links to Page1(previous page) and Page3(next page). SO it again crawls Page1 with referrer as Page2 and its going in loop.

scrapy 版本,我用的是 0.17.

The scrapy version, I use is 0.17.

我在网上搜索了答案并尝试了以下方法,1)

I have searched through web for answers and tried the following, 1)

Rule(SgmlLinkExtractor(allow=("ref=sr_pg_*")), callback="parse_items_1", unique= True, follow= True),

但唯一命令未被识别为有效参数.

But the unique command was not indentified as a valid parameter.

2)我试图在设置中将默认过滤器指定为 DUPEFILTER_CLASS = RFPDupeFilter

2) I have tried to specify default filter in settings as DUPEFILTER_CLASS = RFPDupeFilter

    DUPEFILTER_CLASS = RFPDupeFilter
NameError: name 'RFPDupeFilter' is not defined

3) 我也试过有一个自定义过滤器,我在网上找到了这个片段,但不太了解它.代码如下.访问 id 和状态被捕获,但它不会识别已经被抓取的页面.

3) I have also tried to have a custom filter, which I found the snippet in web, but do not understand much from it. The code is as follows. The visit id and status is captured but it doesn't identify the already crawled pages.

注意:该片段是从网上复制的,我对此没有太多详细信息,

Note: the snippet is copied from web and i dont have much details of this,

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from Amaze.items import AmazeItem

class IgnoreVisitedItems(object):
    FILTER_VISITED = 'filter_visited'
    VISITED_ID = 'visited_id'
    CONTEXT_KEY = 'visited_ids'

    def process_spider_output(self, response, result, spider):
        context = getattr(spider, 'context', {})
        visited_ids = context.setdefault(self.CONTEXT_KEY, {})
        ret = []
        for x in result:
            visited = False
            if isinstance(x, Request):
                if self.FILTER_VISITED in x.meta:
                    visit_id = self._visited_id(x)
                    if visit_id in visited_ids:
                        log.msg("Ignoring already visited: %s" % x.url,
                                level=log.INFO, spider=spider)
                        visited = True
            elif isinstance(x, BaseItem):
                visit_id = self._visited_id(response.request)
                if visit_id:
                    visited_ids[visit_id] = True
                    x['visit_id'] = visit_id
                    x['visit_status'] = 'new'
            if visited:
                ret.append(MyItem(visit_id=visit_id, visit_status='old'))
            else:
                ret.append(x)
        return ret

    def _visited_id(self, request):
        return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

我的目的是让蜘蛛本身忽略已经爬过的网页,而不是将爬过的页面放在一个列表中,并在每次页面被抓取或不与列表匹配时进行匹配.

My intention is to have the spider itself ignore the already crawled webpages, rather than having the crawled pages in a list and matching every time a page is cralwed or not with the list.

请对此有任何想法.

推荐答案

您没有得到您的 Spider 的代码示例,但是,您可能在调用 时传递参数 dont_filter = True>Request 方法.尝试明确指定Request(dont_filter=False).这向 Spider 指出他不必重复相同的请求.

You are not given a code example of your Spider, but, possible, you pass argument dont_filter = True, when calling Request method. Try to specify explicitly Request(dont_filter=False). This points out to Spider that he don't must repeat identical requests.

这篇关于Scrapy - 蜘蛛抓取重复的网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆