scrapy:处理 url 中的特殊字符 [英] scrapy: dealing with special characters in url

查看:62
本文介绍了scrapy:处理 url 中的特殊字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取一个 XML 站点地图,其中包含像 é 这样的特殊字符,结果是

ERROR: Spider 错误处理 <GET [URL with '%C3%A9' 而不是 'é']>

如何让 Scrapy 保持原始 URL 不变,即其中包含特殊字符?

Scrapy==1.3.3

Python==3.5.2(我需要坚持这些版本)

更新:根据 https://stackoverflow.com/a/17082272/6170115 我能够获得带有正确字符的 URL使用 unquote:

示例用法:

<预><代码>>>>从 urllib.parse 导入取消引用>>>unquote('ros%C3%A9')'玫瑰'

我也尝试了没有 safe_url_string 的我自己的 Request 子类,但我最终得到:

UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)

完整回溯:

[scrapy.core.scraper] 错误:下载错误 <GET [URL with characters likeù]>回溯(最近一次调用最后一次):文件/usr/share/anaconda3/lib/python3.5/site-packages/twisted/internet/defer.py",第 1384 行,在 _inlineCallbacks结果 = result.throwExceptionIntoGenerator(g)文件/usr/share/anaconda3/lib/python3.5/site-packages/twisted/python/failure.py",第 393 行,在 throwExceptionIntoGenerator 中返回 g.throw(self.type, self.value, self.tb)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py",第43行,在process_request中defer.returnValue((yield download_func(request=request,spider=spider)))文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/defer.py",第45行,在mustbe_deferred结果 = f(*args, **kw)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/__init__.py",第65行,在download_request返回 handler.download_request(请求,蜘蛛)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py",第61行,在download_request返回 agent.download_request(request)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py",第260行,在download_requestagent = self._get_agent(请求,超时)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py",第241行,在_get_agent方案 = _parse(request.url)[0]文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py",第 37 行,在 _parse 中返回_parsed_url_args(解析)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py",第 19 行,在 _parsed_url_args路径 = b(路径)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py",第17行,在<lambda>b = lambda s: to_bytes(s, encoding='ascii')文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/python.py",第 120 行,在 to_bytes 中返回 text.encode(编码,错误)UnicodeEncodeError: 'ascii' 编解码器无法对位置 25 中的字符 '\xf9' 进行编码:序号不在范围内 (128)

有什么建议吗?

解决方案

我不认为你可以像 Scrapy 使用 safe_url_string 来自 w3lib 库,然后再存储 Request 网址.你将不得不以某种方式扭转这一点.

I'm scraping an XML sitemap which contains special characters like é, which results in

ERROR: Spider error processing <GET [URL with '%C3%A9' instead of 'é']>

How do I get Scrapy to keep the original URL as is, i.e. with the special character in it?

Scrapy==1.3.3

Python==3.5.2 (I need to stick to these versions)

Update: As per https://stackoverflow.com/a/17082272/6170115 I was able to get the URL with the correct character using unquote:

Example usage:

>>> from urllib.parse import unquote
>>> unquote('ros%C3%A9')
'rosé'

I also tried my own Request subclass without safe_url_string but I end up with:

UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)

Full traceback:

[scrapy.core.scraper] ERROR: Error downloading <GET [URL with characters like ù]>
Traceback (most recent call last):
  File "/usr/share/anaconda3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request
return handler.download_request(request, spider)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 61, in download_request
return agent.download_request(request)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 260, in download_request
agent = self._get_agent(request, timeout)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 241, in _get_agent
scheme = _parse(request.url)[0]
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 37, in _parse
return _parsed_url_args(parsed)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 19, in _parsed_url_args
path = b(path)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 17, in <lambda>
b = lambda s: to_bytes(s, encoding='ascii')
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/python.py", line 120, in to_bytes
return text.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)

Any tips?

解决方案

I don't think you can do that as Scrapy uses safe_url_string from w3lib library before storing Requests URL. You would somehow have to reverse that.

这篇关于scrapy:处理 url 中的特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆