Scrapy:'str' 对象没有属性 'iter' [英] Scrapy: 'str' object has no attribute 'iter'
问题描述
我向我的爬虫蜘蛛添加了 restrict_xpaths
规则,现在它立即失败了:
I added restrict_xpaths
rules to my scrapy spider and now it immediately fails with:
2015-03-16 15:46:53+0000 [tsr] ERROR: Spider error processing <GET http://www.thestudentroom.co.uk/forumdisplay.php?f=143>
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/task.py", line 602, in _tick
taskObj._oneWorkUnit()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/task.py", line 479, in _oneWorkUnit
result = self._iterator.next()
File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
for x in result:
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Library/Python/2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 73, in _parse_response
for request_or_item in self._requests_to_follow(response):
File "/Library/Python/2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 52, in _requests_to_follow
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
File "/Library/Python/2.7/site-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 107, in extract_links
links = self._extract_links(doc, response.url, response.encoding, base_url)
File "/Library/Python/2.7/site-packages/scrapy/linkextractor.py", line 94, in _extract_links
return self.link_extractor._extract_links(*args, **kwargs)
File "/Library/Python/2.7/site-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 50, in _extract_links
for el, attr, attr_val in self._iter_links(selector._root):
**File "/Library/Python/2.7/site-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 38, in _iter_links
for el in document.iter(etree.Element):
exceptions.AttributeError: 'str' object has no attribute 'iter'**
我不明白为什么会发生这个错误.
I cannot understand why this error is happening.
这是我的简短Spider
:
import scrapy
from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class TsrSpider(CrawlSpider):
name = 'tsr'
allowed_domains = ['thestudentroom.co.uk']
start_urls = ['http://www.thestudentroom.co.uk/forumdisplay.php?f=143']
download_delay = 4
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:35.0) Gecko/20100101 Firefox/35.0'
rules = (
Rule(
LinkExtractor(
allow=('forumdisplay\.php\?f=143\&page=\d',),
restrict_xpaths=("//li[@class='pager-page_numbers']/a/@href",))),
Rule(
LinkExtractor(
allow=('showthread\.php\?t=\d+\&page=\d+',),
restrict_xpaths=("//li[@class='pager-page_numbers']/a/@href",)),
callback='parse_link'),
Rule(
LinkExtractor(
allow=('showthread\.php\?t=\d+',),
restrict_xpaths=("//tr[@class='thread unread ']",)),
callback='parse_link'),
)
def parse_link(self, response):
# Iterate over posts.
for sel in response.xpath("//li[@class='post threadpost old ']"):
rating = sel.xpath(
"div[@class='post-footer']//span[@class='score']/text()").extract()
if not rating:
rating = 0
else:
rating = rating[0]
item = DmozItem()
item['post'] = sel.xpath(
"div[@class='post-content']/blockquote[@class='postcontent restore']/text()").extract()
item['link'] = response.url
item['topic'] = response.xpath(
"//div[@class='forum-header section-header']/h1/span/text()").extract()
item['rating'] = rating
yield item
来源:http://pastebin.com/YXdWvPgX
有人可以帮我吗?错误在哪里?我找了好几天了!?
Can someone help me out? Where is the mistake? I've been searching for days!?
推荐答案
问题在于 restrict_xpaths
应该指向元素 - 直接链接或包含链接的容器,不是属性:
The problem is that restrict_xpaths
should point to elements - either the links directly or containers containing links, not attributes:
rules = [
Rule(LinkExtractor(allow='forumdisplay\.php\?f=143\&page=\d',
restrict_xpaths="//li[@class='pager-page_numbers']/a")),
Rule(LinkExtractor(allow='showthread\.php\?t=\d+\&page=\d+',
restrict_xpaths="//li[@class='pager-page_numbers']/a"),
callback='parse_link'),
Rule(LinkExtractor(allow='showthread\.php\?t=\d+',
restrict_xpaths="//tr[@class='thread unread ']"),
callback='parse_link'),
]
经过测试(对我有用).
Tested (worked for me).
仅供参考,Scrapy 定义了 restrict_xpaths
作为指向区域的表达式":
FYI, Scrapy defines restrict_xpaths
as "expressions pointing to regions":
restrict_xpaths
(str or list) – 是一个 XPath(或 XPath 的列表)在应提取链接的响应中定义区域从.如果给定,则只会扫描那些 XPath 选择的文本对于链接.请参阅下面的示例.
restrict_xpaths
(str or list) – is a XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links. See examples below.
这篇关于Scrapy:'str' 对象没有属性 'iter'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!