使用 PyQt5 抓取包含受保护内容的网站 [英] Scraping websites with protected content using PyQt5
本文介绍了使用 PyQt5 抓取包含受保护内容的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我试图从需要登录的动态网站中抓取内容.我发现这段代码适用于 PyQt4 使用 PyQt4 抓取 Javascript 驱动的网页 - 如何访问需要身份验证的页面?
I am try to scrape content from a dynamic website that requires login. I found this piece of code that works for PyQt4 Scraping Javascript driven web pages with PyQt4 - how to access pages that need authentication?
#!/usr/bin/python
# -*- coding: latin-1 -*-
import sys
import base64
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4 import QtNetwork
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
username = 'username'
password = 'password'
base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
authheader = "Basic %s" % base64string
headerKey = QByteArray("Authorization")
headerValue = QByteArray(authheader)
url = QUrl(url)
req = QtNetwork.QNetworkRequest()
req.setRawHeader(headerKey, headerValue)
req.setUrl(url)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(req)
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
def main():
url = 'http://www.google.com'
r = Render(url)
html = r.frame.toHtml()
如何将其翻译为适用于 PyQt5?
How can I translate thesame to work for PyQt5 ?
推荐答案
你必须使用 QWebEnginePage 所以任务是异步的,因为我从 HTML 中获得,而且 QtWebEngine 不使用 QNetworkRequest 所以你必须使用 QWebEngineHttpRequest:
You have to use QWebEnginePage so the tasks are asynchronous as I obtained from the HTML, also QtWebEngine does not use QNetworkRequest so you must use QWebEngineHttpRequest:
import sys
from PyQt5.QtCore import QByteArray, QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineCore import QWebEngineHttpRequest
from PyQt5.QtWebEngineWidgets import QWebEnginePage
class Render(QWebEnginePage):
def __init__(self, url):
app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self._html = ""
username = "username"
password = "password"
base64string = QByteArray(("%s:%s" % (username, password)).encode()).toBase64()
request = QWebEngineHttpRequest(QUrl.fromUserInput(url))
equest.setHeader(b"Authorization", b"Basic: %s" % (base64string,))
self.load(request)
app.exec_()
@property
def html(self):
return self._html
def _loadFinished(self):
self.toHtml(self.handle_to_html)
def handle_to_html(self, html):
self._html = html
QApplication.quit()
def main():
url = "http://www.google.com"
r = Render(url)
print(r.html)
if __name__ == "__main__":
main()
这篇关于使用 PyQt5 抓取包含受保护内容的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文