如何获取html页面的真实源代码? [英] How to get real source code of html page?

查看:643
本文介绍了如何获取html页面的真实源代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每次使用urllib2,请求,pycurl之类的标准库时,我都没有获得完整的源代码。如何从chrome,firefox等中获取完整的源代码,就像我正在查看的那样。
我正在尝试这样做:

Every time when I'm using standart librarys like urllib2, requests, pycurl I am not getting full source code. How can I get full source code like I am looking on it from chrome, firefox, etc. I am trying to do it like this:

def go_to(link):
    headers = {'User-Agent': USER_AGENT,
               'Accept': ACCEPT,
               'Accept-Encoding': ACCEPT_ENCODING,
               'Accept-Language': ACCEPT_LANGUAGE,
               'Cache-Control': CACHE_CONTROL,
               'Connection': CONNECTION,
               'Host': HOST}
    req = urllib2.Request(link, None, headers)
    response = urllib2.urlopen(req)
    return response.read()

谢谢!

对不起,我的英语不好。

Sorry for my bad english.

UPD:
这是来自浏览器的完整代码:

UPD: This is full code from browser:

 <td colspan="1"><font class="spy1">1</font> <font class="spy14">192.3.10.113<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(TwoFiveFiveSix^OneOneSix)+(Zero0FourFour^ZeroSevenSeven)+(TwoFiveFiveSix^OneOneSix)+(TwoFiveFiveSix^OneOneSix))</script><font class="spy2">:</font>8088</font></td>

这不是我脚本中的完整代码:

This is not full code from my script:

<font class="spy14">192.3.10.113<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(Eight7FiveSix^Seven1One)+(FiveZeroTwoOne^Two3Zero)+(Eight7FiveSix^Seven1One)+(Eight7FiveSix^Seven1One))</script></font>


推荐答案

最好的解决方案是:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://webscraping.com'  
r = Render(url)  
html = r.frame.toHtml() 

来源: http://webscraping.com/blog/Scraping-JavaScript -webpages-with-webkit /

UPD:
输出类型是QString。
如果要将其转换为字符串,请使用

UPD: Type of output is QString. If you want to convert it to string use

html = r.frame.toHtml().toUtf8().data()

这篇关于如何获取html页面的真实源代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆