Scrapy 正在跟踪和抓取不允许的链接 [英] Scrapy is following and scraping non-allowed links

查看:30
本文介绍了Scrapy 正在跟踪和抓取不允许的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 CrawlSpider 设置为跟踪某些链接并抓取新闻杂志,其中每个问题的链接都遵循以下 URL 方案:

I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme:

http://example.com/YYYY/DDDD/index.htm 其中 YYYY 是年份和DDDD 是三位数或四位数的问题编号.

http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number.

我只想要第 928 期以后的问题,并在下面制定我的规则.我在连接到站点、抓取链接或提取项目方面没有任何问题(所以我没有包含其余的代码).蜘蛛似乎决心遵循不允许的链接.它试图抓取问题 377、398 等,并遵循culture.htm"和feature.htm"链接.这会引发很多错误并且不是非常重要,但它需要大量清理数据.关于出了什么问题有什么建议吗?

I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links. It is trying to scrape issues 377, 398, and more, and follows the "culture.htm" and "feature.htm" links. This throws a lot of errors and isn't terribly important but it requires a lot of cleaning of the data. Any suggestions as to what is going wrong?

class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]

rules = (
        Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),
        Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),
        Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),
        Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),
    )

我在 2009、2010、2011 年使用更简单的正则表达式修复了这个问题,但我仍然很好奇为什么如果有人有任何建议,上述方法不起作用.

I fixed this using a much simpler regex fot 2009, 2010, 2011, but I am still curious why the above doesn't work if anyone has any suggestions.

推荐答案

您需要将 deny 参数传递给 SgmlLinkExtractor,后者收集链接到 follow>.如果它们调用一个函数parse_item,您就不需要创建这么多Rule.我会把你的代码写成:

You need to pass deny arguments to SgmlLinkExtractor which collects links to follow. And you don't need to create so many Rule's if they call one function parse_item. I would write your code as:

rules = (
        Rule(SgmlLinkExtractor(
                    allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ),
                    deny = ('culture\.htm', 'feature\.htm'),
                    ), 
            follow = True
        ),
        Rule(SgmlLinkExtractor(
                allow = (
                    'fr[0-9].htm', 
                    'eg[0-9]*.htm',
                    'ec[0-9]*.htm',
                    'op[0-9]*.htm',
                    'sc[0-9]*.htm',
                    're[0-9]*.htm',
                    'in[0-9]*.htm',
                    )
                ), 
                callback = 'parse_item',
        ),
    )

如果它是您用于parse_item 的规则中的真实 url 模式,则可以简化为:

If it's real url patterns in rules you are using to parse_item, it can be simplified to this:

 Rule(SgmlLinkExtractor(
                allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ), 
                callback = 'parse_item',
        ),
 )

这篇关于Scrapy 正在跟踪和抓取不允许的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆