如何获取html页面的真实源代码? [英] How to get real source code of html page?
本文介绍了如何获取html页面的真实源代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
每次使用urllib2,请求,pycurl之类的标准库时,我都没有获得完整的源代码。如何从chrome,firefox等中获取完整的源代码,就像我正在查看的那样。
我正在尝试这样做:
Every time when I'm using standart librarys like urllib2, requests, pycurl I am not getting full source code. How can I get full source code like I am looking on it from chrome, firefox, etc. I am trying to do it like this:
def go_to(link):
headers = {'User-Agent': USER_AGENT,
'Accept': ACCEPT,
'Accept-Encoding': ACCEPT_ENCODING,
'Accept-Language': ACCEPT_LANGUAGE,
'Cache-Control': CACHE_CONTROL,
'Connection': CONNECTION,
'Host': HOST}
req = urllib2.Request(link, None, headers)
response = urllib2.urlopen(req)
return response.read()
谢谢!
对不起,我的英语不好。
Sorry for my bad english.
UPD::
这是来自浏览器的完整代码:
UPD: This is full code from browser:
<td colspan="1"><font class="spy1">1</font> <font class="spy14">192.3.10.113<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(TwoFiveFiveSix^OneOneSix)+(Zero0FourFour^ZeroSevenSeven)+(TwoFiveFiveSix^OneOneSix)+(TwoFiveFiveSix^OneOneSix))</script><font class="spy2">:</font>8088</font></td>
这不是我脚本中的完整代码:
This is not full code from my script:
<font class="spy14">192.3.10.113<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(Eight7FiveSix^Seven1One)+(FiveZeroTwoOne^Two3Zero)+(Eight7FiveSix^Seven1One)+(Eight7FiveSix^Seven1One))</script></font>
推荐答案
最好的解决方案是:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://webscraping.com'
r = Render(url)
html = r.frame.toHtml()
来源: http://webscraping.com/blog/Scraping-JavaScript -webpages-with-webkit /
UPD:
输出类型是QString。
如果要将其转换为字符串,请使用
UPD: Type of output is QString. If you want to convert it to string use
html = r.frame.toHtml().toUtf8().data()
这篇关于如何获取html页面的真实源代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文