当只遵循一个规则时如何修复scrapy规则 [英] How to fix scrapy rules when only one rule is followed

查看:42
本文介绍了当只遵循一个规则时如何修复scrapy规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此代码无效:

name="souq_com"allowed_domains=['uae.souq.com']start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]规则 = (#categoriesRule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'),unique=True)),Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div/a'),unique=True),callback='parse_item'),规则(SgmlLinkExtractor(allow=(r'.*?page=\d+'),unique=True)),)

第一条规则是获得响应,但第二条规则不起作用.我确定第二个规则 xpath 是正确的(我已经尝试过使用 scrapy shell)我还尝试向第一条规则添加回调并选择第二条规则的路径('//div[@id="ItemResultList"]/div/div/div/a') 并发出一个请求并且它工作正常.

我也尝试了一种解决方法,我尝试使用 Base Spider 而不是 Crawl Spider,它只发出第一个请求,不发出回调.我该如何解决?

解决方案

规则的顺序很重要.根据 scrapy docsCrawlSpider 规则:

<块引用>

如果多个规则匹配同一个链接,将根据它们在此属性中定义的顺序使用第一个.

如果我点击 http://uae 中的第一个链接.souq.com/ae-en/shop-all-categories/c/,即http://uae.souq.com/ae-en/antique/l/,你要关注的项目都在这个结构内

<div id="box-ads-souq-1340" class="box-container">...<div id="box-results" class="box-container box-container-none"><div class="box box-style-none box-padding-none"><div class="bord_b_dash overhidden hidden-phone"><div class="item-all-controls-wrapper"><div id="ItemResultList"><div class="single-item-browse fl width-175 height-310 position-relative"><div class="single-item-browse fl width-175 height-310 position-relative">...

因此,您使用第二条规则定位的链接位于 <div> 中,它们的类中有fl",因此它们也匹配第一条规则,该规则查找 <div>代码>'//div[@id="body-column-main"]//div[contains(@class,"fl")]',因此不会parse_item

解析

简单的解决方案:尝试将您的第二条规则放在类别"规则之前(unique=True 默认用于 SgmlLinkExtractor)

name="souq_com"allowed_domains=['uae.souq.com']start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]规则 = (Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div')), callback='parse_item'),#categoriesRule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'))),规则(SgmlLinkExtractor(allow=(r'.*?page=\d+'))),)

另一种选择是将类别页面的第一条规则更改为限制性更强的 XPath,该规则在各个类别页面中不存在,例如 '//div[@id="body-column-main"]//div[contains(@class,"fl")]//ul[@class="refinementBrowser-mainList"]'

您还可以为类别页面定义一个正则表达式,并在您的规则中使用 accept 参数.

This code is not working:

name="souq_com"
allowed_domains=['uae.souq.com']
start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]

rules = (
    #categories
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'),unique=True)),
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div/a'),unique=True),callback='parse_item'),
    Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'),unique=True)),
)

The first rule is getting responses, but the second rule is not working. I'm sure that the second's rule xpath is correct (I've tried it using scrapy shell ) I also tried adding a callback to the first rule and selecting the path of the second rule ('//div[@id="ItemResultList"]/div/div/div/a') and issuing a Request and it's working correctly.

I also tried a workaround, I tried to use a Base spider instead of a Crawl Spider, it only issues the first request and doesn't issue the callback. how should I fix that ?

解决方案

The order of rules is important. According to scrapy docs for CrawlSpider rules:

If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.

If I follow the first link in http://uae.souq.com/ae-en/shop-all-categories/c/, i.e. http://uae.souq.com/ae-en/antique/l/, the items you want to follow are within this structure

<div id="body-column-main">
    <div id="box-ads-souq-1340" class="box-container ">...
    <div id="box-results" class="box-container box-container-none ">
        <div class="box box-style-none box-padding-none">
            <div class="bord_b_dash overhidden hidden-phone">
            <div class="item-all-controls-wrapper">
            <div id="ItemResultList">
                <div class="single-item-browse fl width-175 height-310 position-relative">
                <div class="single-item-browse fl width-175 height-310 position-relative">
                ...

So, the links you target with the 2nd Rule are in <div> that have "fl" in their class, so they also match the first rule, which looks for all links in '//div[@id="body-column-main"]//div[contains(@class,"fl")]', and therefore will NOT be parsed with parse_item

Simple solution: Try putting your 2nd rule before the "categories" Rule (unique=True by default for SgmlLinkExtractor)

name="souq_com"
allowed_domains=['uae.souq.com']
start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]

rules = (
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div')), callback='parse_item'),

    #categories
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'))),

    Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'))),
)

Another option is to change your first rule for category pages to a more restrictive XPath, that does not exist in the individual category pages, such as '//div[@id="body-column-main"]//div[contains(@class,"fl")]//ul[@class="refinementBrowser-mainList"]'

You could also define a regex for the category pages and use accept parameter in you Rules.

这篇关于当只遵循一个规则时如何修复scrapy规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆