Scrapy SgmlLinkExtractor 问题 [英] Scrapy SgmlLinkExtractor question

查看：49 发布时间：2021/7/16 21:53:35 python web-crawler scrapy

本文介绍了Scrapy SgmlLinkExtractor 问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使 SgmlLinkExtractor 工作.

I am trying to make the SgmlLinkExtractor to work.

这是签名:

SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)

我只是使用 allow=()

所以，我输入

rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)

所以，初始 url 是 'http://www.whitecase.com/jacevedo/' 而我正在输入 allow=('/aadler',)并期待'/aadler/' 也会被扫描.但是，蜘蛛扫描初始 url 然后关闭:

So, the initial url is 'http://www.whitecase.com/jacevedo/' and I am entering allow=('/aadler',) and expect that '/aadler/' will get scanned as well. But instead, the spider scans the initial url and then closes:

[wcase] INFO: Domain opened
[wcase] DEBUG: Crawled </jacevedo/> (referer: <None>)
[wcase] INFO: Passed NuItem(school=[u'JD, ', u'Columbia Law School, Harlan Fiske Stone Scholar, Parker School Recognition of Achievement in International and Foreign Law, ', u'2005'])
[wcase] INFO: Closing domain (finished)

我在这里做错了什么?

这里有没有人成功使用过 Scrapy 可以帮我完成这个蜘蛛?

Is there anyone here who used Scrapy successfully who can help me to finish this spider?

感谢您的帮助.

我在下面包含了蜘蛛的代码:

I include the code for the spider below:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u

class NuSpider(CrawlSpider):
    domain_name = "wcase"
    start_urls = ['xxxxxx/jacevedo/']

    rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        item = NuItem()
        item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
        return item

SPIDER = NuSpider()

注意:SO 不会让我发布超过 1 个网址，因此请根据需要替换初始网址.很抱歉.

Note: SO will not let me post more than 1 url so substitute the initial url as necessary. Sorry about that.

Scrapy SgmlLinkExtractor 问题 [英] Scrapy SgmlLinkExtractor question

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy SgmlLinkExtractor 问题 [英] Scrapy SgmlLinkExtractor question

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭