什么是“获得”的最佳方式一个网页? [英] What is the best way to "get" a web page?

查看:75
本文介绍了什么是“获得”的最佳方式一个网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码:

I have the following code:


>> web_page = urllib.urlopen (" http://www.python.org")
file = open(" temp.html"," w")
web_page_contents = web_page.read()
文件.write(web_page_contents)
file.close
>>web_page = urllib.urlopen("http://www.python.org")
file = open("temp.html", "w")
web_page_contents = web_page.read()
file.write(web_page_contents)
file.close



<文件对象的内置方法关闭位于0xb7cc76e0>

<built-in method close of file object at 0xb7cc76e0>


>>>
>>>



文件temp.html已创建,但它看起来不像页面
www.python.org 。我猜测有多个框架而我的代码确实没有获得所有内容。任何人都可以指向我如何获取的教程或其他

参考特定

页面上的所有html内容?


为什么Python在file.close后打印该行?


谢谢,

Pete

The file "temp.html" is created, but it doesn''t look like the page at
www.python.org. I''m guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close"?

Thanks,
Pete

推荐答案

" Pete" < ha ************** @ post.comwrote in message

news:11 **************** ******@i3g2000cwc.googlegro ups.com ...
"Pete" <ha**************@post.comwrote in message
news:11**********************@i3g2000cwc.googlegro ups.com...

>我有以下代码:
>I have the following code:

>>> web_page = urllib.urlopen(" http://www.python.org")
file = open(" temp .html"," w")
web_page_contents = web_page.read()
file.write(web_page_contents)
file.close
>>>web_page = urllib.urlopen("http://www.python.org")
file = open("temp.html", "w")
web_page_contents = web_page.read()
file.write(web_page_contents)
file.close



<文件对象的内置方法关闭位于0xb7cc76e0>

<built-in method close of file object at 0xb7cc76e0>


>>>>
>>>>



文件temp.html已创建,但它看起来不像页面
www.python.org 。我猜测有多个框架而我的代码确实没有获得所有内容。任何人都可以指向我如何获取的教程或其他

参考特定

页面上的所有html内容?


为什么Python在file.close后打印该行?


谢谢,

Pete


The file "temp.html" is created, but it doesn''t look like the page at
www.python.org. I''m guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close"?

Thanks,
Pete



A.你实际上没有调用close方法,你只是引用它,

这就是你在file.close之后获得输出行的原因。 Python不是VB。

要调用close,你必须用()来跟随它,如:


file.close()


这样可以将输出刷新到temp.html,

可能包含您要查找的缺失内容。


B.不要将变量命名为file,或list,str,dict,int等等。做

因此屏蔽内置数据类型的全局名称。试试tempFile相反。


- Paul

A. You didn''t actually invoke the close method, you simply referenced it,
which is why you got the output line after file.close. Python is not VB.
To call close, you have to follow it with ()''s, as in:

file.close()

This will have the added benefit of flushing the output to temp.html,
probably containing the missing content you were looking for.

B. Don''t name variables "file", or "list", "str", "dict", "int", etc. Doing
so masks global names of builtin data types. Try "tempFile" instead.

-- Paul


我有以下代码:
I have the following code:

>> web_page = urllib.urlopen(" http://www.python.org")
file = open (" temp.html"," w")
web_page_contents = web_page.read()
file.write(web_page_contents)
file.close
>>web_page = urllib.urlopen("http://www.python.org")
file = open("temp.html", "w")
web_page_contents = web_page.read()
file.write(web_page_contents)
file.close



<文件对象的内置方法关闭位于0xb7cc76e0>

<built-in method close of file object at 0xb7cc76e0>


>>>
>>>



文件temp.html已创建,但它看起来不像页面
www.python.org 。我猜测有多个框架而我的代码确实没有获得所有内容。任何人都可以指向我如何获取的教程或其他

参考特定

页面上的所有html内容?


为什么Python在file.close后打印该行?


谢谢,

Pete

The file "temp.html" is created, but it doesn''t look like the page at
www.python.org. I''m guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close"?

Thanks,
Pete



A.你实际上没有调用close方法,你只是引用它,

这就是你在file.close之后获得输出行的原因。 Python不是VB。

要调用close,你必须用()来跟随它,如:


file.close()


A. You didn''t actually invoke the close method, you simply referenced it,
which is why you got the output line after file.close. Python is not VB.
To call close, you have to follow it with ()''s, as in:

file.close()



Ahhhh。非常感谢!

Ahhhh. Thank you very much!


这样可以将输出刷新到temp.html,

可能包含您丢失的内容正在寻找。


B.不要将变量命名为file,或list,str,dict,int,做

所以屏蔽内置数据类型的全局名称。试试tempFile代替。
This will have the added benefit of flushing the output to temp.html,
probably containing the missing content you were looking for.

B. Don''t name variables "file", or "list", "str", "dict", "int", etc. Doing
so masks global names of builtin data types. Try "tempFile" instead.



哦。再次感谢!

文件temp.html绝对不同于第一次运行,但

仍然没有接近 www。 python.org 。还有其他建议吗?


谢谢,

Pete

Oh. Thanks again!
The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?

Thanks,
Pete


- Paul
-- Paul


Pete写道:
Pete wrote:

文件temp.html绝对不同于第一次运行,但

仍然没有接近 www。 python.org 。还有其他建议吗?
The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?



如果你的意思是浏览器中的页面看起来不一样,那么你需要下载css文件。

。这是来自主页的相关摘录




< link media =" screen" href =" styles / screen-switcher-default.css"

type =" text / css" ID = QUOT;屏幕切换样式表"的rel ="样式表" />

< link media =" scReen" HREF ="花柱/ netscape4.css" type =" text / css"

rel =" stylesheet" />

< link media =" print" HREF ="花柱/ print.css" type =" text / css"

rel =" stylesheet" />

< link media =" screen" HREF ="花柱/ largestyles.css" type =" text / css"

rel =" alternate stylesheet" title =" large text" />

< link media =" screen" HREF ="花柱/ defaultfonts.css" type =" text / css"

rel =" alternate stylesheet" title =" default fonts" />


你可以硬编码css文件的网址,也可以解析页面,

提取css链接并将它们规范化为绝对网址。首先是

更简单,但第二个更强大,以防添加新的css或

现有的重命名或删除。

乔治

If you mean that the page looks different in a browser, for one thing
you have to download the css files too. Here''s the relevant extract
from the main page:

<link media="screen" href="styles/screen-switcher-default.css"
type="text/css" id="screen-switcher-stylesheet" rel="stylesheet" />
<link media="scReen" href="styles/netscape4.css" type="text/css"
rel="stylesheet" />
<link media="print" href="styles/print.css" type="text/css"
rel="stylesheet" />
<link media="screen" href="styles/largestyles.css" type="text/css"
rel="alternate stylesheet" title="large text" />
<link media="screen" href="styles/defaultfonts.css" type="text/css"
rel="alternate stylesheet" title="default fonts" />

You may either hardcode the urls of the css files, or parse the page,
extract the css links and normalize them to absolute urls. The first is
simpler but the second is more robust, in case a new css is added or an
existing one is renamed or removed.

George


这篇关于什么是“获得”的最佳方式一个网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆