Scrapy SgmlLinkExtractor 忽略允许的链接 [英] Scrapy SgmlLinkExtractor is ignoring allowed links

查看：33 发布时间：2021/7/17 18:31:00 python web-crawler scrapy

本文介绍了Scrapy SgmlLinkExtractor 忽略允许的链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

请查看 Scrapy 文档中的这个蜘蛛示例.解释是:

Please take a look at this spider example in Scrapy documentation. The explanation is:

这个蜘蛛会开始抓取example.com的主页，收集分类链接和项目链接，用parse_item方法解析后者.对于每个项目响应，将使用 XPath 从 HTML 中提取一些数据，并用它填充一个项目.

This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and a Item will be filled with it.

我完全复制了同一个蜘蛛，并用另一个初始网址替换了example.com".

I copied the same spider exactly, and replaced "example.com" with another initial url.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from stb.items import StbItem

class StbSpider(CrawlSpider):
    domain_name = "stb"
    start_urls = ['http://www.stblaw.com/bios/MAlpuche.htm']

    rules = (Rule(SgmlLinkExtractor(allow=(r'/bios/.\w+\.htm', )), callback='parse', follow=True), )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        item = StbItem()
        item['JD'] = hxs.select('//td[@class="bodycopysmall"]').re('\d\d\d\d\sJ.D.')
        return item

SPIDER = StbSpider()

但是我的蜘蛛stb"没有像它应该做的那样从/bios/"收集链接.它运行初始 url，抓取 item['JD'] 并将其写入文件，然后退出.

But my spider "stb" does not collect links from "/bios/" as it is supposed to do. It runs the initial url, scrapes the item['JD'] and writes it on a file and then quits.

为什么会忽略 SgmlLinkExtractor?读取 Rule 是因为它捕获了 Rule 行内的语法错误.

Why is it that SgmlLinkExtractor is ignored? The Rule is read because it catches syntax errors inside the Rule line.

这是一个错误吗?我的代码有问题吗?除了我每次运行时看到的一堆未处理的错误之外，没有任何错误.

Is this a bug? is there something wrong in my code? There are no errors except a bunch unhandled errors that I see with every run.

很高兴知道我在这里做错了什么.感谢您提供任何线索.我是否误解了 SgmlLinkExtractor 应该做什么?

It would be nice to know what I am doing wrong here. Thanks for any clues. Am I misunderstanding what SgmlLinkExtractor is supposed to do?

Scrapy SgmlLinkExtractor 忽略允许的链接 [英] Scrapy SgmlLinkExtractor is ignoring allowed links

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy SgmlLinkExtractor 忽略允许的链接 [英] Scrapy SgmlLinkExtractor is ignoring allowed links

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭