当我分割一些HTML源代码时,为什么会显示b'(有时是b'')[Python] [英] why does b'(and sometimes b' ') show up when I split some HTML source[Python]

查看:175
本文介绍了当我分割一些HTML源代码时,为什么会显示b'(有时是b'')[Python]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一般对Python和编程都不熟悉.我完成了一些教程,通过一本相当不错的书,大约是2/3.话虽这么说,我只是通过尝试将std lib中的内容尝试使自己更熟悉Python和编程.

I'm fairly new to Python and programming in general. I have done a few tutorials and am about 2/3 through a pretty good book. That being said I've been trying to get more comfortable with Python and proggramming by just trying things in the std lib out.

据说我最近遇到了一个奇怪的怪癖,我确信这是我自己对urllib模块(使用Python 3.2.2)的不正确或非"pythonic"使用的结果.

that being said I have recently run into a wierd quirk that I'm sure is the result of my own incorrect or un-"pythonic" use of the urllib module(with Python 3.2.2)

import urllib.request

HTML_source = urllib.request.urlopen(www.somelink.com).read()

print(HTML_source)

当该位通过活动解释器运行时,它返回somelink的HTML源,但是它以b'为前缀 例如

when this bit is run through the active interpreter it returns the HTML source of somelink, however it prefixes it with b' for example

b'<HTML>\r\n<HEAD> (etc). . . .

如果我用空格将字符串分割成一个列表,它会在每个项目的前面加上b'

if I split the string into a list by whitespace it prefixes every item with the b'

我并不是真的想完成某些特定的事情,只是想让自己熟悉std lib.我想知道为什么这个b'被加上前缀

I'm not really trying to accomplish something specific just trying to familiarize myself with the std lib. I would like to know why this b' is getting prefixed

还有一个好处-是否有更好的方法可以在不使用第三方模块的情况下获取HTML源代码.我知道所有关于不重新发明轮子的爵士乐,但我不知道,但我正在尝试通过构建自己的工具"来学习

also bonus -- Is there a better way to get HTML source WITHOUT using a third party module. I know all that jazz about not reinventing the wheel and what not but I'm trying to learn by "building my own tools"

预先感谢!

推荐答案

前缀"b"表示类型为 bytes 而不是 str .要将字节转换为文本,请使用 decode 方法并命名适当的编码.编码通常在"Content-Type"标头中找到:

The "b" prefix means that the type is bytes not str. To convert the bytes into text, use the decode method and name the appropriate encoding. The encoding is often found in the "Content-Type" header:

>>> u = urllib.request.urlopen('http://cnn.com')
>>> u.getheader('Content-Type')
'text/html; charset=UTF-8'
>>> html = u.read().decode('utf-8')
>>> type(html)
<class 'str'>

如果在标题中找不到编码,请尝试将 utf-8 作为默认值.

If you don't find the encoding in the headers, try utf-8 as a default.

这篇关于当我分割一些HTML源代码时,为什么会显示b'(有时是b'')[Python]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆