使用 QWebPage 抓取多个 url [英] Scrape multiple urls using QWebPage

查看:24
本文介绍了使用 QWebPage 抓取多个 url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Qt 的 QWebPage 来呈现使用 javascript 动态更新其内容的页面 - 因此仅下载页面静态版本(例如 urllib2)的库将无法工作.

I'm using Qt's QWebPage to render a page that uses javascript to update its content dynamically - so a library that just downloads a static version of the page (such as urllib2) won't work.

我的问题是,当我渲染第二页时,大约 99% 的时间程序都崩溃了.其他时候,它会在崩溃前工作三遍.我也遇到了一些段错误,但这都是非常随机的.

My problem is, when I render a second page, about 99% of the time the program just crashes. At other times, it will work three times before crashing. I've also gotten a few segfaults, but it is all very random.

我的猜测是我用来渲染的对象没有被正确删除,所以尝试重用它可能会给我自己带来一些问题.我找遍了所有地方,似乎没有人遇到同样的问题.

My guess is the object I'm using to render isn't getting deleted properly, so trying to reuse it is possibly causing some problems for myself. I've looked all over and no one really seems to be having this same issue.

这是我正在使用的代码.该程序从 Steam 的社区市场下载网页,因此我可以创建所有项目的数据库.我需要多次调用 getItemsFromPage 函数来获取所有项目,因为它们被分解成页面(显示 X 数量中的 1-10 个结果).

Here's the code I'm using. The program downloads web pages from steam's community market so I can create a database of all the items. I need to call the getItemsFromPage function multiple times to get all of the items, as they are broken up into pages (showing results 1-10 out of X amount).

import csv
import re
import sys
from string import replace
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *

class Item:
    __slots__ = ("name", "count", "price", "game")

    def __repr__(self):
        return self.name + "(" + str(self.count) + ")"

    def __str__(self):
        return self.name + ", " + str(self.count) + ", $" + str(self.price)

class Render(QWebPage):  
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()
        self.deleteLater()

def getItemsFromPage(appid, page=1):

    r = Render("http://steamcommunity.com/market/search?q=appid:" + str(appid) + "#p" + str(page))

    soup = BeautifulSoup(str(r.frame.toHtml().toUtf8()))

    itemLst = soup.find_all("div", "market_listing_row market_recent_listing_row")

    items = []

    for k in itemLst:
        i = Item()

        i.name = k.find("span", "market_listing_item_name").string
        i.count = int(replace(k.find("span", "market_listing_num_listings_qty").string, ",", ""))
        i.price = float(re.search(r'$([0-9]+.[0-9]+)', str(k)).group(1))
        i.game = appid

        items.append(i)

    return items

if __name__ == "__main__":

    print "Updating market items to dota2.csv ..."

    i = 1

    with open("dota2.csv", "w") as f:
        writer = csv.writer(f)

        r = None

        while True:
            print "Page " + str(i)

            items = getItemsFromPage(570)

            if len(items) == 0:
                print "No items found, stopping..."
                break

            for k in items:
                writer.writerow((k.name, k.count, k.price, k.game))

            i += 1

    print "Done."

调用 getItemsFromPage 一次可以正常工作.随后的电话给了我我的问题.程序的输出通常是

Calling getItemsFromPage once works fine. Subsequent calls give me my problem. The output of the program is typically

Updating market items to dota2.csv ...
Page 1
Page 2

然后它崩溃了.它应该持续超过 700 页.

and then it crashes. It should go on for over 700 pages.

推荐答案

你的程序的问题是你试图用你获取的每个 url 创建一个新的 QApplication.

The problem with your program is that you are attempting to create a new QApplication with every url you fetch.

相反,应该只创建一个 QApplication 和一个 WebPage.WebPage 可以使用它的 loadFinished 信号来创建一个内部循环,方法是在每个 url 被处理后获取一个新的 url.可以通过将用户定义的插槽连接到信号来添加自定义 html 处理,该信号在 html 文本和 url 可用时发出.下面的脚本(用于 PyQt5 和 PyQt4)展示了如何实现这一点.

Instead, only one QApplication and one WebPage should be created. The WebPage can use its loadFinished signal to create an internal loop by fetching a new url after each one has been processed. Custom html processing can be added by connecting a user-defined slot to a signal which emits the html text and the url when they become available. The scripts below (for PyQt5 and PyQt4) show how to implement this.

以下是展示如何使用 WebPage 类的一些示例:

Here are some examples which show how to use the WebPage class:

用法:

def my_html_processor(html, url):
    print('loaded: [%d chars] %s' % (len(html), url))

import sys
app = QApplication(sys.argv)
webpage = WebPage(verbose=False)
webpage.htmlReady.connect(my_html_processor)

# example 1: process list of urls

urls = ['https://en.wikipedia.org/wiki/Special:Random'] * 3
print('Processing list of urls...')
webpage.process(urls)

# example 2: process one url continuously
#
# import signal, itertools
# signal.signal(signal.SIGINT, signal.SIG_DFL)
#
# print('Processing url continuously...')
# print('Press Ctrl+C to quit')
#
# url = 'https://en.wikipedia.org/wiki/Special:Random'
# webpage.process(itertools.repeat(url))

sys.exit(app.exec_())

PyQt5 网页:

from PyQt5.QtCore import pyqtSignal, QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEnginePage

class WebPage(QWebEnginePage):
    htmlReady = pyqtSignal(str, str)

    def __init__(self, verbose=False):
        super().__init__()
        self._verbose = verbose
        self.loadFinished.connect(self.handleLoadFinished)

    def process(self, urls):
        self._urls = iter(urls)
        self.fetchNext()

    def fetchNext(self):
        try:
            url = next(self._urls)
        except StopIteration:
            return False
        else:
            self.load(QUrl(url))
        return True

    def processCurrentPage(self, html):
        self.htmlReady.emit(html, self.url().toString())
        if not self.fetchNext():
            QApplication.instance().quit()

    def handleLoadFinished(self):
        self.toHtml(self.processCurrentPage)

    def javaScriptConsoleMessage(self, *args, **kwargs):
        if self._verbose:
            super().javaScriptConsoleMessage(*args, **kwargs)

PyQt4 网页:

from PyQt4.QtCore import pyqtSignal, QUrl
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage

class WebPage(QWebPage):
    htmlReady = pyqtSignal(str, str)

    def __init__(self, verbose=False):
        super(WebPage, self).__init__()
        self._verbose = verbose
        self.mainFrame().loadFinished.connect(self.handleLoadFinished)

    def start(self, urls):
        self._urls = iter(urls)
        self.fetchNext()

    def fetchNext(self):
        try:
            url = next(self._urls)
        except StopIteration:
            return False
        else:
            self.mainFrame().load(QUrl(url))
        return True

    def processCurrentPage(self):
        self.htmlReady.emit(
            self.mainFrame().toHtml(), self.mainFrame().url().toString())
        print('loaded: [%d bytes] %s' % (self.bytesReceived(), url))

    def handleLoadFinished(self):
        self.processCurrentPage()
        if not self.fetchNext():
            QApplication.instance().quit()

    def javaScriptConsoleMessage(self, *args, **kwargs):
        if self._verbose:
            super(WebPage, self).javaScriptConsoleMessage(*args, **kwargs)

这篇关于使用 QWebPage 抓取多个 url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆