使用python和urllib从网页打印代码 [英] Print code from web page with python and urllib

查看:156
本文介绍了使用python和urllib从网页打印代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python和urllib来查看某个网页的代码。
我已尝试使用以下代码在其他网页上取得此成功:

I'm trying to use python and urllib to look at the code of a certain web page. I've tried and succeeded this at other webpages using the code:

from urllib import *
url = 
code = urlopen(url).read()
print code

但它什么都没有回报。我的猜测是因为该页面有很多javascripts?怎么做?

But it returns nothing at all. My guess is it's because the page has a lot of javascripts? What to do?

推荐答案

动态客户端生成的页面(JavaScript)



您不能单独使用urllib来查看动态呈现的客户端(JavaScript)代码。原因是urllib只从服务器获取响应,它是标题和正文(实际代码)。因此,我不会执行客户端代码。

Dynamic client side generated pages (JavaScript)

You can not use urllib alone to see code that been rendered dynamically client side (JavaScript). The reason is that urllib only fetches the response from the server which is headers and the body (the actual code). Because of that I will not execute the client side code.

但是你可以使用像 selenium 用于远程控制网络浏览器(Chrome或Firefox)。即使用javascript呈现页面,也可以废弃页面。

You can however use something like selenium to remote control a web browser (Chrome or Firefox). That will make it possible for you to scrap the page even though it renders with javascript.

以下是使用selenium进行抓取的示例:使用带有selenium的python来抓取动态网页

Here is a sample of scraping with selenium: Using python with selenium to scrape dynamic web pages

然而,这个网站的问题似乎是他们不想被刮掉。它们使用某些http用户代理标头阻止客户端。

The problem with this site however seems to be that they don't want to be scraped. They block clients with certain http user-agent headers.

但是,如果你伪造了http标头,你仍然可以获得代码。使用urllib2代替urllib,如下所示:

You can however get the code anyway if you fake the http headers. Use urllib2 instead of urllib like this:

import urllib2
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox')  # Add fake client
response = urllib2.urlopen(req)
print response.read()

但是,他们显然不会我希望你刮掉他们的网站,所以你应该考虑这是不是一个好主意。

But, they clearly don't want you to scrape their site, so you should consider if this is a good idea.

这篇关于使用python和urllib从网页打印代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆