Python 3.4.0 - 'ascii'编解码器无法编码位置11-15中的字符:序号不在范围(128) - Unix 14.04 [英] Python 3.4.0 -- 'ascii' codec can't encode characters in position 11-15: ordinal not in range(128) -- Unix 14.04
问题描述
尝试使用urlib和lxml从网络中检索一些数据,我有一个错误,不知道如何解决它。
url ='http://sum.in.ua/?swrd =автор'
/ pre>
page = urllib.request.urlopen(url)
错误本身:
UnicodeEncodeError:'ascii'编解码器不能编码位置11-15中的字符:序号不在范围(128)
这次在使用乌克兰语的API中,但是当我使用API(没有任何乌克兰语字母)时):
url = http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=8&Itemid=9
page = urllib.request.urlopen( url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text1 = xmldata .xpath('// p [@ class =MsoNormal] // text()')
解决方案URL只能使用可打印ASCII码点的子集;必须使用网址百分比编码对其他内容进行正确编码。
你可以通过让Python处理你的参数来最好地实现。
urllib.parse.urlencode( )
功能可以转换用于URL的字典(或键值对序列):from urllib.parse import urlencode
url ='http://sum.in.ua/'
参数= {'swrd ':'автор'}
url ='{}?{}'。format(url,urlencode(parameters))
page = urllib.request.urlopen(url)
这将首先将参数编码为UTF-8字节,然后将这些字节转换为百分号编码序列: p>
>>>来自urllib.parse import urlencode
>>>> parameters = {'swrd':'автор'}
>>> urlencode(参数)
'swrd =%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
如果您没有自己构建此URL,则需要修复编码。您可以拆分查询字符串,将其解析成字典,然后将其传递给
urlencode
,使用urllib.parse.urlparse()
和urllib.parse.parse_qs ()
:from urllib.parse import urlparse,parse_qs,urlencode
url ='http://sum.in.ua/?swrd=автор'
parsed_url = urlparse(url)
参数= parse_qs(parsed_url.query)
url = parsed_url._replace(query = urlencode(parameters,doseq = True))geturl()
这将URL分解成其组成部分,解析查询字符串,然后重新编码并重新构建URL:
>>>来自urllib.parse import urlparse,parse_qs,urlencode
>>> url ='http://sum.in.ua/?swrd=автор'
>>> parsed_url = urlparse(url)
>>> parameters = parse_qs(parsed_url.query)
>>> parsed_url._replace(query = urlencode(parameters,doseq = True))geturl()
'http://sum.in.ua/?swrd=%D0%B0%D0%B2%D1%82% D0%BE%D1%80'
Trying to retrieve some data from the web using urlib and lxml, I've got an error and have no idea, how to fix it.
url='http://sum.in.ua/?swrd=автор' page = urllib.request.urlopen(url)
The error itself:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-15: ordinal not in range(128)
I'm using Ukrainian in API this time, but when I use API (without any Ukrainian letters in it) here:
url="http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id=8&Itemid=9" page = urllib.request.urlopen(url) pageWritten = page.read() pageReady = pageWritten.decode('utf-8') xmldata = lxml.html.document_fromstring(pageReady) text1 = xmldata.xpath('//p[@class="MsoNormal"]//text()')
it gets me the data in Ukrainian and everything works just fine.
解决方案URLs can only use a subset of printable ASCII codepoints; everything else must be properly encoded using URL percent encoding.
You can best achieve that by letting Python handle your parameters. The
urllib.parse.urlencode()
function can convert a dictionary (or a sequence of key-value pairs) for use in URLs:from urllib.parse import urlencode url = 'http://sum.in.ua/' parameters = {'swrd': 'автор'} url = '{}?{}'.format(url, urlencode(parameters)) page = urllib.request.urlopen(url)
This will first encode the parameters to UTF-8 bytes, then convert those bytes to percent-encoding sequences:
>>> from urllib.parse import urlencode >>> parameters = {'swrd': 'автор'} >>> urlencode(parameters) 'swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
If you did not construct this URL yourself, you'll need to 'repair' the encoding. You can split of the query string, parse it into a dictionary, then pass it to
urlencode
to put it back into the URL usingurllib.parse.urlparse()
andurllib.parse.parse_qs()
:from urllib.parse import urlparse, parse_qs, urlencode url = 'http://sum.in.ua/?swrd=автор' parsed_url = urlparse(url) parameters = parse_qs(parsed_url.query) url = parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
This splits the URL into its constituent parts, parses out the query string, re-encodes and re-builds the URL afterwards:
>>> from urllib.parse import urlparse, parse_qs, urlencode >>> url = 'http://sum.in.ua/?swrd=автор' >>> parsed_url = urlparse(url) >>> parameters = parse_qs(parsed_url.query) >>> parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl() 'http://sum.in.ua/?swrd=%D0%B0%D0%B2%D1%82%D0%BE%D1%80'
这篇关于Python 3.4.0 - 'ascii'编解码器无法编码位置11-15中的字符:序号不在范围(128) - Unix 14.04的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!