使用 PyQt5 抓取包含受保护内容的网站 [英] Scraping websites with protected content using PyQt5

查看:94
本文介绍了使用 PyQt5 抓取包含受保护内容的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从需要登录的动态网站中抓取内容.我发现这段代码适用于 PyQt4 使用 PyQt4 抓取 Javascript 驱动的网页 - 如何访问需要身份验证的页面?

I am try to scrape content from a dynamic website that requires login. I found this piece of code that works for PyQt4 Scraping Javascript driven web pages with PyQt4 - how to access pages that need authentication?

#!/usr/bin/python
# -*- coding: latin-1 -*-
import sys
import base64
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4 import QtNetwork

class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(sys.argv)

    username = 'username'
    password = 'password'

    base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
    authheader = "Basic %s" % base64string

    headerKey = QByteArray("Authorization")
    headerValue = QByteArray(authheader)

    url = QUrl(url)
    req = QtNetwork.QNetworkRequest()
    req.setRawHeader(headerKey, headerValue)
    req.setUrl(url)

    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)


    self.mainFrame().load(req)
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

def main():
    url = 'http://www.google.com'
    r = Render(url)
    html = r.frame.toHtml()

如何将其翻译为适用于 PyQt5?

How can I translate thesame to work for PyQt5 ?

推荐答案

你必须使用 QWebEnginePage 所以任务是异步的,因为我从 HTML 中获得,而且 QtWebEngine 不使用 QNetworkRequest 所以你必须使用 QWebEngineHttpRequest:

You have to use QWebEnginePage so the tasks are asynchronous as I obtained from the HTML, also QtWebEngine does not use QNetworkRequest so you must use QWebEngineHttpRequest:

import sys

from PyQt5.QtCore import QByteArray, QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineCore import QWebEngineHttpRequest
from PyQt5.QtWebEngineWidgets import QWebEnginePage


class Render(QWebEnginePage):
    def __init__(self, url):
        app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.loadFinished.connect(self._loadFinished)

        self._html = ""

        username = "username"
        password = "password"
        base64string = QByteArray(("%s:%s" % (username, password)).encode()).toBase64()
        request = QWebEngineHttpRequest(QUrl.fromUserInput(url))
        equest.setHeader(b"Authorization", b"Basic: %s" % (base64string,))

        self.load(request)

        app.exec_()

    @property
    def html(self):
        return self._html

    def _loadFinished(self):
        self.toHtml(self.handle_to_html)

    def handle_to_html(self, html):
        self._html = html
        QApplication.quit()


def main():
    url = "http://www.google.com"
    r = Render(url)
    print(r.html)


if __name__ == "__main__":
    main()

这篇关于使用 PyQt5 抓取包含受保护内容的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆