如何从python 3中的url中读取html [英] How to read html from a url in python 3

查看:488
本文介绍了如何从python 3中的url中读取html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



在python 3.4中,我希望读取一个html页面作为字符串,给定url。

/ p>

在perl中,我用LWP :: Simple做了这个,使用get()。

matplotlib 1.3.1示例说: import urllib; U1 = urllib.urlretrieve(URL)
python3找不到 urlretrieve



我试过 u1 = urllib .request.urlopen(url),它似乎得到一个 HTTPResponse 对象,但我无法打印它或获取长度或索引它。



u1.body 不存在。我无法在python3中找到 HTTPResponse 的描述。



HTTPResponse 对象,它会给我的HTML页面的原始字节?



(其他问题的不相关的东西包括 urllib2 ,它不存在于我的python,csv分析器等中)。



编辑:



我在之前的问题中发现了一些部分(主要)完成这项工作的问题:

  u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')

用于u2.readlines中的行):
print(行)

我说'部分',因为我不想阅读单独的行,但只是一个大字符串。



我可以将行连接起来,但打印的每行都有一个前缀为'b'的字符。



从哪里来?

再次,我想我可以删除冷杉t字符在连接之前,但这确实成为一个kloodge。

注意,Python3不读取html代码作为一个字符串,但是作为 bytearray ,所以你需要将它转换为 decode 。

 导入urllib.request 

fp = urllib.request.urlopen(http://www.python.org)
mybytes = fp.read()
$ b $ mystr = mybytes.decode(utf8)
fp.close()

print(mystr)


I looked at previous similar questions and got only more confused.

In python 3.4, I want to read an html page as a string, given the url.

In perl I do this with LWP::Simple, using get().

A matplotlib 1.3.1 example says: import urllib; u1=urllib.urlretrieve(url). python3 can't find urlretrieve.

I tried u1 = urllib.request.urlopen(url), which appears to get an HTTPResponse object, but I can't print it or get a length on it or index it.

u1.body doesn't exist. I can't find a description of the HTTPResponse in python3.

Is there an attribute in the HTTPResponse object which will give me the raw bytes of the html page?

(Irrelevant stuff from other questions include urllib2, which doesn't exist in my python, csv parsers, etc.)

Edit:

I found something in a prior question which partially (mostly) does the job:

u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')

for lines in u2.readlines():
    print (lines)

I say 'partially' because I don't want to read separate lines, but just one big string.

I could just concatenate the lines, but every line printed has a character 'b' prepended to it.

Where does that come from?

Again, I suppose I could delete the first character before concatenating, but that does get to be a kloodge.

解决方案

Note that Python3 does not read the html code as a string but as a bytearray, so you need to convert it to one with decode.

import urllib.request

fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

print(mystr)

这篇关于如何从python 3中的url中读取html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆