如何处理urllib2.urlopen的url中的®? [英] how to deal with ® in url for urllib2.urlopen?

查看:36
本文介绍了如何处理urllib2.urlopen的url中的®?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到了一个网址:https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions;它来自 BeautifulSoup.

url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'

我想再次反馈到urllib2.urlopen.

导入 urllib2源 = urllib2.urlopen(url).read()

我得到的错误:

UnicodeEncodeError: 'gbk' codec can't encode character u'\xae' in position 43: 非法多字节序列

因此,我尝试了:

source = urllib2.urlopen(url.encode("utf-8")).read()

它得到了页面源,但它与原始url的不同.

originalUrl = 'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions'originalSource = urllib2.urlopen(originalUrl).read()原始来源 == 来源

结果为假.有没有办法修复这个网址?如何将 u'\xae' 转换为原始的 ®?

解决方案

URL 必须是有效的字节串,并且非 ASCII 代码点编码正确.您需要编码为 UTF-8,然后 url 引用您的 URL 路径:

导入urllib导入 urllib2导入 urlparseoriginalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))encoding_link = parsed_link.geturl()源 = urllib2.urlopen(encoded_link).read()

演示:

<预><代码>>>>导入 urllib>>>导入 urllib2>>>导入 urlparse>>>originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'>>>parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))>>>parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))>>>encoding_link = parsed_link.geturl()>>>编码链接'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp%C2%AE-75-desktop-virtualization-solutions'>>>源 = urllib2.urlopen(encoded_link).read()>>>len(来源)68758

I received a url: https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions; it is from BeautifulSoup.

url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'

I want to feed back into urllib2.urlopen again.

import urllib2
source = urllib2.urlopen(url).read()

The error I get:

UnicodeEncodeError: 'gbk' codec can't encode character u'\xae' in position 43: illegal multibyte sequence

Thus, I tried:

source = urllib2.urlopen(url.encode("utf-8")).read()

It got page source, however it is different from what from the original url.

originalUrl = 'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions'
originalSource = urllib2.urlopen(originalUrl).read()
originalSource == source

The result is False. Is there any idea to fix this url? How to convert u'\xae' into original ®?

解决方案

URLs must be valid bytestring, with non-ASCII codepoints encoded correctly. You'll need to encode to UTF-8, then url quote the path of your URL:

import urllib
import urllib2
import urlparse

originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
encoded_link = parsed_link.geturl()
source = urllib2.urlopen(encoded_link).read()

Demo:

>>> import urllib
>>> import urllib2 
>>> import urlparse
>>> originalUrl = u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
>>> parsed_link = urlparse.urlsplit(originalUrl.encode('utf8'))
>>> parsed_link = parsed_link._replace(path=urllib.quote(parsed_link.path))
>>> encoded_link = parsed_link.geturl()
>>> encoded_link
'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp%C2%AE-75-desktop-virtualization-solutions'
>>> source = urllib2.urlopen(encoded_link).read()
>>> len(source)
68758

这篇关于如何处理urllib2.urlopen的url中的®?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆