Scrapy SgmlLinkExtractor 忽略允许的链接 [英] Scrapy SgmlLinkExtractor is ignoring allowed links

查看:33
本文介绍了Scrapy SgmlLinkExtractor 忽略允许的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请查看 Scrapy 文档中的这个蜘蛛示例.解释是:

Please take a look at this spider example in Scrapy documentation. The explanation is:

这个蜘蛛会开始抓取example.com的主页,收集分类链接和项目链接,用parse_item方法解析后者.对于每个项目响应,将使用 XPath 从 HTML 中提取一些数据,并用它填充一个项目.

This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and a Item will be filled with it.

我完全复制了同一个蜘蛛,并用另一个初始网址替换了example.com".

I copied the same spider exactly, and replaced "example.com" with another initial url.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from stb.items import StbItem

class StbSpider(CrawlSpider):
    domain_name = "stb"
    start_urls = ['http://www.stblaw.com/bios/MAlpuche.htm']

    rules = (Rule(SgmlLinkExtractor(allow=(r'/bios/.\w+\.htm', )), callback='parse', follow=True), )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        item = StbItem()
        item['JD'] = hxs.select('//td[@class="bodycopysmall"]').re('\d\d\d\d\sJ.D.')
        return item

SPIDER = StbSpider()

但是我的蜘蛛stb"没有像它应该做的那样从/bios/"收集链接.它运行初始 url,抓取 item['JD'] 并将其写入文件,然后退出.

But my spider "stb" does not collect links from "/bios/" as it is supposed to do. It runs the initial url, scrapes the item['JD'] and writes it on a file and then quits.

为什么会忽略 SgmlLinkExtractor?读取 Rule 是因为它捕获了 Rule 行内的语法错误.

Why is it that SgmlLinkExtractor is ignored? The Rule is read because it catches syntax errors inside the Rule line.

这是一个错误吗?我的代码有问题吗?除了我每次运行时看到的一堆未处理的错误之外,没有任何错误.

Is this a bug? is there something wrong in my code? There are no errors except a bunch unhandled errors that I see with every run.

很高兴知道我在这里做错了什么.感谢您提供任何线索.我是否误解了 SgmlLinkExtractor 应该做什么?

It would be nice to know what I am doing wrong here. Thanks for any clues. Am I misunderstanding what SgmlLinkExtractor is supposed to do?

推荐答案

parse 函数实际上是在 CrawlSpider 类中实现和使用的,您无意中覆盖了它.如果您将名称更改为其他名称,例如 parse_item,则该规则应该有效.

The parse function is actually implemented and used in the CrawlSpider class, and you're unintentionally overriding it. If you change the name to something else, like parse_item, then the Rule should work.

这篇关于Scrapy SgmlLinkExtractor 忽略允许的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆