Scrapy SgmlLinkExtractor 问题 [英] Scrapy SgmlLinkExtractor question
问题描述
我正在尝试使 SgmlLinkExtractor 工作.
I am trying to make the SgmlLinkExtractor to work.
这是签名:
SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)
我只是使用 allow=()
所以,我输入
rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)
所以,初始 url 是 'http://www.whitecase.com/jacevedo/'
而我正在输入 allow=('/aadler',)
并期待'/aadler/'
也会被扫描.但是,蜘蛛扫描初始 url 然后关闭:
So, the initial url is 'http://www.whitecase.com/jacevedo/'
and I am entering allow=('/aadler',)
and expect that
'/aadler/'
will get scanned as well. But instead, the spider scans the initial url and then closes:
[wcase] INFO: Domain opened
[wcase] DEBUG: Crawled </jacevedo/> (referer: <None>)
[wcase] INFO: Passed NuItem(school=[u'JD, ', u'Columbia Law School, Harlan Fiske Stone Scholar, Parker School Recognition of Achievement in International and Foreign Law, ', u'2005'])
[wcase] INFO: Closing domain (finished)
我在这里做错了什么?
这里有没有人成功使用过 Scrapy 可以帮我完成这个蜘蛛?
Is there anyone here who used Scrapy successfully who can help me to finish this spider?
感谢您的帮助.
我在下面包含了蜘蛛的代码:
I include the code for the spider below:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u
class NuSpider(CrawlSpider):
domain_name = "wcase"
start_urls = ['xxxxxx/jacevedo/']
rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = NuItem()
item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
return item
SPIDER = NuSpider()
注意:SO 不会让我发布超过 1 个网址,因此请根据需要替换初始网址.很抱歉.
Note: SO will not let me post more than 1 url so substitute the initial url as necessary. Sorry about that.
推荐答案
您正在覆盖它出现的解析"方法.parse",是 CrawlSpider 中用于跟踪链接的私有方法.
You are overriding the "parse" method it appears. "parse", is a private method in CrawlSpider used to follow links.
这篇关于Scrapy SgmlLinkExtractor 问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!