具有UnicodeEncodeError的国家字符的URL [英] URL with national characters giving UnicodeEncodeError

查看:152
本文介绍了具有UnicodeEncodeError的国家字符的URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要提取字典条目:

  url ='http://www.lingvo.ua/uk /解释/ uk-ru /вікно'
#parsed_url = urlparse(url)
#parameters = parse_qs(parsed_url.query)
#url = parsed_url._replace(query = urlencode doseg = True))。geturl()
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text = xmldata.xpath(// div [@ class =js-article-html g-card])

要么打开或关闭注释行,它会收到错误:

  UnicodeEncodeError:'ascii'编解码器无法编码位置24-28中的字符:序数不在范围内(128)
urllib.parse.quote(string)在Python 3或 urllib.quote(string) in Python 2。

 #Python 3 
import urllib.parse
url ='http://www.lingvo.ua'+ urllib.parse。引用('/ uk / Interpret / uk-ru /вікно')

#Python 2
import urllib
url ='http://www.lingvo.ua'+ urllib.quote(u'/ uk / Interpret / uk-ru /вікно'.encode('UTF-8'))


b $ b

注意:根据对Unicode字符进行URL编码的正确方法是什么?,应对网址进行编码作为UTF-8。但是,这并不排除对生成的非ASCII,UTF-8字符进行百分比编码。


I'm trying to extract dictionary entry:

url = 'http://www.lingvo.ua/uk/Interpret/uk-ru/вікно'
# parsed_url = urlparse(url)
# parameters = parse_qs(parsed_url.query)
# url = parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl()
page = urllib.request.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
text = xmldata.xpath(//div[@class="js-article-html g-card"])

either with commented lines on or off, it keeps getting an error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-28: ordinal not in range(128)

解决方案

Your issue is that you have non-ASCII characters within your URL path which must be properly encoded using urllib.parse.quote(string) in Python 3 or urllib.quote(string) in Python 2.

# Python 3
import urllib.parse
url = 'http://www.lingvo.ua' + urllib.parse.quote('/uk/Interpret/uk-ru/вікно')

# Python 2
import urllib
url = 'http://www.lingvo.ua' + urllib.quote(u'/uk/Interpret/uk-ru/вікно'.encode('UTF-8'))

NOTE: According to What is the proper way to URL encode Unicode characters?, URLs should be encoded as UTF-8. However, that does not preclude percent encoding the resulting non-ASCII, UTF-8 characters.

这篇关于具有UnicodeEncodeError的国家字符的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆