为什么Python中坚持使用ASCII？ [英] Why is Python insisting on using ascii?

查看：182 发布时间：2016/8/5 19:11:03 python utf-8 ascii beautifulsoup python-requests

本文介绍了为什么Python中坚持使用ASCII？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当与解析请求和美丽的汤HTML文件，下面一行是在一些网页抛出一个异常：

When parsing an HTML file with Requests and Beautiful Soup, the following line is throwing an exception on some web pages:

if 'var' in str(tag.string):

下面是上下文：

response = requests.get(url)  
soup = bs4.BeautifulSoup(response.text.encode('utf-8'))

for tag in soup.findAll('script'):
    if 'var' in str(tag.string):    # This is the line throwing the exception
        print(tag.string)

下面是个例外：

统一codeDE codeError：ASCIIcodeC可以在15位没有去code字节0xc3：序数不在范围内（128）

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in range(128)

我曾尝试既没有使用连接code（UTF-8）在功能 BeautifulSoup 行，这都没有区别。我注意到，在页面抛出异常出现在JavaScript的注释字符 A ，即使通过response.encoding报告的编码是 ISO-8859-1 。我不知道我可以用单codedata.normalize拆除违规的文字，但我会preFER到标签转换成变量 UTF-8 并保持字符。下面的方法都不利于改变变量 UTF-8 ：

I have tried both with and without using the encode('utf-8') function in the BeautifulSoup line, it makes no difference. I do note that for the pages throwing the exception there is a character Ã in a comment in the javascript, even though the encoding reported by response.encoding is ISO-8859-1. I do realise that I can remove the offending characters with unicodedata.normalize however I would prefer to convert the tag variable to utf-8 and keep the characters. None of the following methods help to change the variable to utf-8:

tag.encode('utf-8')
tag.decode('ISO-8859-1').encode('utf-8')
tag.decode(response.encoding).encode('utf-8')

我必须做这个字符串为了什么把它转化成有用的 UTF-8 ？谢谢！

推荐答案

好了，所以基本上你要在连接codeD HTTP响应的Latin-1 。性格给你上课的问题确实 A ，因为看的这里您可能会看到 0xC3 是正是人物Latin-1的。

Ok so basically you're getting an HTTP response encoded in Latin-1. The character giving you problem es indeed Ã because looking here you may see that 0xC3 is exactly that character in Latin-1.

我觉得你盲目测试你想象中的约解码/编码要求每个组合。首先，如果你这样做：如果STR（tag.string）VAR：每当字符串 VAR包含非ASCII字节，Python会投诉。

I think you blinded test every combination you imagined about decoding/encoding the request. First of all, if you do this: if 'var' in str(tag.string): whenever string var contains non-ASCII bytes, python will complaint.

纵观code你与我们分享，正确的做法恕我直言是：

Looking at the code you've shared with us, the right approach IMHO would be:

response = requests.get(url)
# decode the latin-1 bytes to unicode  
#soup = bs4.BeautifulSoup(response.text.decode('latin-1'))
#try this line instead
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)

for tag in soup.findAll('script'):
    # since now soup was made with unicode strings I supposed you can treat
    # its elements as so
    if u'var' in tag.string:    # This is the line throwing the exception
        # now if you want output in utf-8
        print(tag.string.encode('utf-8'))

编辑：这将是有益的给你看看的从BeautifiulSoup 4文档中的编码部分

It will be useful for you to take a look at the encoding section from the BeautifiulSoup 4 doc

基本上，逻辑是：

您得到一些字节连接$ C $光盘编码 X

您去code X 通过执行 bytes.de code（'X'），这会返回一个单向code字节序列

您使用UNI code

您连接code单向code一些编码是为输出 ubytes.en code（' Y'）

You get some bytes encoded in encoding X
You decode X by doing bytes.decode('X') and this returns a unicode byte sequence
You work with unicode
You encode the unicode to some encoding Y for the output ubytes.encode('Y')

希望这会带来一些轻的问题。

Hope this bring some light to the problem.

这篇关于为什么Python中坚持使用ASCII？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么Python中坚持使用ASCII？ [英] Why is Python insisting on using ascii?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么Python中坚持使用ASCII？ [英] Why is Python insisting on using ascii?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭