如何获取网页及其框架的html dom [英] How to get the html dom of a webpage and its frames

查看:878
本文介绍了如何获取网页及其框架的html dom的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在js执行后获取网站的DOM。
我还希望获得网站中iframe的所有内容,类似于我在Google Chrome的Inspect Element功能中所拥有的内容。

I would like to get the DOM of a website after js execution. I would also like to get all the content of the iframes in the website, similarly to what I have in Google Chrome's Inspect Element feature.

这是我的代码:

import sys
from PyQt4 import QtGui, QtCore, QtWebKit

class Sp():
  def save(self):
    print ("call")
    data = self.webView.page().currentFrame().documentElement().toInnerXml()
    print(data.encode('utf-8'))
    print ('finished')
  def main(self):
    self.webView = QtWebKit.QWebView()
    self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))
    QtCore.QObject.connect(self.webView,QtCore.SIGNAL("loadFinished(bool)"),self.save)

app = QtGui.QApplication(sys.argv)
s = Sp()
s.main()
sys.exit(app.exec_())

这给了我的html网站,但不是iframe内的html。有什么方法可以获得iframe的HTML。

This gives me the html of the website, but not the html inside the iframes. Is there any way that I could get the HTML of the iframes.

推荐答案

这是一个非常难以解决的问题。

This is a very hard problem to solve in general.

主要的困难是没有办法事先知道每页有多少帧。除此之外,每个子帧可能有自己的一组帧,其数量也是未知的。理论上,可能存在无限数量的嵌套帧,页面永远不会完成加载(对于拥有大量广告的网站来说,这似乎并不夸张)。

The main difficulty is that there is no way to know in advance how many frames each page has. And in addition to that, each child-frame may have its own set of frames, the number of which is also unknown. In theory, there could be an infinite number of nested frames, and the page will never finish loading (which seems no exaggeration for sites that have a lot of ads).

无论如何,下面是你的脚本的一个版本,它获得顶级 QWebFrame 对象加载时每个帧的显示方式,并显示如何访问您感兴趣的一些内容。正如您将从输出中看到的那样,广告中插入了大量垃圾框架,这样您将需要某种方式过滤掉。

Anyway, below is a version of your script which gets the top-level QWebFrame object of each frame as it loads, and shows how you can access some of the things you are interested in. As you will see from the output, there are a lot of "junk" frames inserted by ads and such like that you will somehow need to filter out.

import sys, signal
from PyQt4 import QtGui, QtCore, QtWebKit

class Sp():
  def save(self, ok, frame=None):
    if frame is None:
        print ('main-frame')
        frame = self.webView.page().mainFrame()
    else:
        print('child-frame')
    print('URL: %s' % frame.baseUrl().toString())
    print('METADATA: %s' % frame.metaData())
    print('TAG: %s' % frame.documentElement().tagName())
    print()

  def handleFrameCreated(self, frame):
    frame.loadFinished.connect(lambda: self.save(True, frame=frame))

  def main(self):
    self.webView = QtWebKit.QWebView()
    self.webView.page().frameCreated.connect(self.handleFrameCreated)
    self.webView.page().mainFrame().loadFinished.connect(self.save)
    self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))

signal.signal(signal.SIGINT, signal.SIG_DFL)
print('Press Crtl+C to quit\n')
app = QtGui.QApplication(sys.argv)
s = Sp()
s.main()
sys.exit(app.exec_())

NB :连接到主框架 loadFinished 信号而不是网络视图非常重要。如果你连接到后者,如果页面包含多个帧,它将被多次调用。

NB: it is important that you connect to the loadFinished signal of the main frame rather than the web-view. If you connect to the latter, it will be called multiple times if the page contains more than one frame.

这篇关于如何获取网页及其框架的html dom的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆