scrapy:处理 url 中的特殊字符 [英] scrapy: dealing with special characters in url
问题描述
我正在抓取一个 XML 站点地图,其中包含像 é 这样的特殊字符,结果是
ERROR: Spider 错误处理 <GET [URL with '%C3%A9' 而不是 'é']>
如何让 Scrapy 保持原始 URL 不变,即其中包含特殊字符?
Scrapy==1.3.3
Python==3.5.2(我需要坚持这些版本)
更新:根据 https://stackoverflow.com/a/17082272/6170115 我能够获得带有正确字符的 URL使用 unquote
:
示例用法:
<预><代码>>>>从 urllib.parse 导入取消引用>>>unquote('ros%C3%A9')'玫瑰'我也尝试了没有 safe_url_string
的我自己的 Request 子类,但我最终得到:
UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)
完整回溯:
[scrapy.core.scraper] 错误:下载错误 <GET [URL with characters likeù]>回溯(最近一次调用最后一次):文件/usr/share/anaconda3/lib/python3.5/site-packages/twisted/internet/defer.py",第 1384 行,在 _inlineCallbacks结果 = result.throwExceptionIntoGenerator(g)文件/usr/share/anaconda3/lib/python3.5/site-packages/twisted/python/failure.py",第 393 行,在 throwExceptionIntoGenerator 中返回 g.throw(self.type, self.value, self.tb)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py",第43行,在process_request中defer.returnValue((yield download_func(request=request,spider=spider)))文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/defer.py",第45行,在mustbe_deferred结果 = f(*args, **kw)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/__init__.py",第65行,在download_request返回 handler.download_request(请求,蜘蛛)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py",第61行,在download_request返回 agent.download_request(request)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py",第260行,在download_requestagent = self._get_agent(请求,超时)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py",第241行,在_get_agent方案 = _parse(request.url)[0]文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py",第 37 行,在 _parse 中返回_parsed_url_args(解析)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py",第 19 行,在 _parsed_url_args路径 = b(路径)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py",第17行,在<lambda>b = lambda s: to_bytes(s, encoding='ascii')文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/python.py",第 120 行,在 to_bytes 中返回 text.encode(编码,错误)UnicodeEncodeError: 'ascii' 编解码器无法对位置 25 中的字符 '\xf9' 进行编码:序号不在范围内 (128)
有什么建议吗?
我不认为你可以像 Scrapy 使用 safe_url_string
来自 w3lib
库,然后再存储 Request
网址.你将不得不以某种方式扭转这一点.
I'm scraping an XML sitemap which contains special characters like é, which results in
ERROR: Spider error processing <GET [URL with '%C3%A9' instead of 'é']>
How do I get Scrapy to keep the original URL as is, i.e. with the special character in it?
Scrapy==1.3.3
Python==3.5.2 (I need to stick to these versions)
Update:
As per https://stackoverflow.com/a/17082272/6170115 I was able to get the URL with the correct character using unquote
:
Example usage:
>>> from urllib.parse import unquote
>>> unquote('ros%C3%A9')
'rosé'
I also tried my own Request subclass without safe_url_string
but I end up with:
UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)
Full traceback:
[scrapy.core.scraper] ERROR: Error downloading <GET [URL with characters like ù]>
Traceback (most recent call last):
File "/usr/share/anaconda3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/share/anaconda3/lib/python3.5/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request
return handler.download_request(request, spider)
File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 61, in download_request
return agent.download_request(request)
File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 260, in download_request
agent = self._get_agent(request, timeout)
File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 241, in _get_agent
scheme = _parse(request.url)[0]
File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 37, in _parse
return _parsed_url_args(parsed)
File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 19, in _parsed_url_args
path = b(path)
File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 17, in <lambda>
b = lambda s: to_bytes(s, encoding='ascii')
File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/python.py", line 120, in to_bytes
return text.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)
Any tips?
I don't think you can do that as Scrapy uses safe_url_string
from w3lib
library before storing Request
s URL. You would somehow have to reverse that.
这篇关于scrapy:处理 url 中的特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!