使用QWebPage抓取多个网址 [英] Scrape multiple urls using QWebPage

查看:103
本文介绍了使用QWebPage抓取多个网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Qt的QWebPage渲染使用javascript动态更新其内容的页面-因此,仅下载页面静态版本(例如urllib2)的库将无法工作.

I'm using Qt's QWebPage to render a page that uses javascript to update its content dynamically - so a library that just downloads a static version of the page (such as urllib2) won't work.

我的问题是,当我渲染第二个页面时,大约有99%的时间该程序崩溃了.在其他时间,它将崩溃三遍.我也遇到了一些段错误,但这都是非常随机的.

My problem is, when I render a second page, about 99% of the time the program just crashes. At other times, it will work three times before crashing. I've also gotten a few segfaults, but it is all very random.

我的猜测是我正在渲染的对象没有被正确删除,因此尝试重用它可能会给我自己造成一些问题.我四处张望,似乎没有人真的遇到过同样的问题.

My guess is the object I'm using to render isn't getting deleted properly, so trying to reuse it is possibly causing some problems for myself. I've looked all over and no one really seems to be having this same issue.

这是我正在使用的代码.该程序从Steam的社区市场下载网页,因此我可以创建所有项目的数据库.我需要多次调用getItemsFromPage函数来获取所有项目,因为它们被分解为几页(显示结果X的数量为1-10).

Here's the code I'm using. The program downloads web pages from steam's community market so I can create a database of all the items. I need to call the getItemsFromPage function multiple times to get all of the items, as they are broken up into pages (showing results 1-10 out of X amount).

import csv
import re
import sys
from string import replace
from bs4 import BeautifulSoup
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *

class Item:
    __slots__ = ("name", "count", "price", "game")

    def __repr__(self):
        return self.name + "(" + str(self.count) + ")"

    def __str__(self):
        return self.name + ", " + str(self.count) + ", $" + str(self.price)

class Render(QWebPage):  
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()
        self.deleteLater()

def getItemsFromPage(appid, page=1):

    r = Render("http://steamcommunity.com/market/search?q=appid:" + str(appid) + "#p" + str(page))

    soup = BeautifulSoup(str(r.frame.toHtml().toUtf8()))

    itemLst = soup.find_all("div", "market_listing_row market_recent_listing_row")

    items = []

    for k in itemLst:
        i = Item()

        i.name = k.find("span", "market_listing_item_name").string
        i.count = int(replace(k.find("span", "market_listing_num_listings_qty").string, ",", ""))
        i.price = float(re.search(r'\$([0-9]+\.[0-9]+)', str(k)).group(1))
        i.game = appid

        items.append(i)

    return items

if __name__ == "__main__":

    print "Updating market items to dota2.csv ..."

    i = 1

    with open("dota2.csv", "w") as f:
        writer = csv.writer(f)

        r = None

        while True:
            print "Page " + str(i)

            items = getItemsFromPage(570)

            if len(items) == 0:
                print "No items found, stopping..."
                break

            for k in items:
                writer.writerow((k.name, k.count, k.price, k.game))

            i += 1

    print "Done."

一次调用getItemsFromPage可以正常工作.随后的电话给了我我的问题.该程序的输出通常为

Calling getItemsFromPage once works fine. Subsequent calls give me my problem. The output of the program is typically

Updating market items to dota2.csv ...
Page 1
Page 2

,然后崩溃.它应该继续进行700多页.

and then it crashes. It should go on for over 700 pages.

推荐答案

程序的问题是,您尝试使用获取的每个URL创建一个新的QApplication.

The problem with your program is that you are attempting to create a new QApplication with every url you fetch.

相反,您应该创建一个QApplication,并处理WebPage类本身内的所有网页加载和处理.关键概念是使用loadFinished信号通过在加载和处理了当前URL后获取一个新URL来创建循环.

Instead, you should create one QApplication, and handle all the loading and processing of web pages within the WebPage class itself. The key concept is to use the loadFinished signal to create a loop by fetching a new url after the current one has been loaded and processed.

下面的两个演示脚本(用于PyQt4和PyQt5)是简化的示例,显示了如何构造程序.希望,如何使它们适应自己的使用应该很明显:

The two demo scripts below (for PyQt4 and PyQt5) are simplified examples that show how to structure the program. Hopefully, it should be fairly obvious how to adapt them for your own use:

import sys
from PyQt4 import QtCore, QtGui, QtWebKit

class WebPage(QtWebKit.QWebPage):
    def __init__(self):
        super(WebPage, self).__init__()
        self.mainFrame().loadFinished.connect(self.handleLoadFinished)

    def start(self, urls):
        self._urls = iter(urls)
        self.fetchNext()

    def fetchNext(self):
        try:
            url = next(self._urls)
        except StopIteration:
            return False
        else:
            self.mainFrame().load(QtCore.QUrl(url))
        return True

    def processCurrentPage(self):
        url = self.mainFrame().url().toString()
        html = self.mainFrame().toHtml()
        # do stuff with html...
        print('loaded: [%d bytes] %s' % (self.bytesReceived(), url))

    def handleLoadFinished(self):
        self.processCurrentPage()
        if not self.fetchNext():
            QtGui.qApp.quit()

if __name__ == '__main__':

    # generate some test urls
    urls = []
    url = 'http://pyqt.sourceforge.net/Docs/PyQt4/%s.html'
    for name in dir(QtWebKit):
        if name.startswith('Q'):
            urls.append(url % name.lower())

    app = QtGui.QApplication(sys.argv)
    webpage = WebPage()
    webpage.start(urls)
    sys.exit(app.exec_())


以下是上述脚本的PyQt5/QWebEngine版本:


Here is a PyQt5/QWebEngine version of the above script:

import sys
from PyQt5 import QtCore, QtWidgets, QtWebEngineWidgets

class WebPage(QtWebEngineWidgets.QWebEnginePage):
    def __init__(self):
        super(WebPage, self).__init__()
        self.loadFinished.connect(self.handleLoadFinished)

    def start(self, urls):
        self._urls = iter(urls)
        self.fetchNext()

    def fetchNext(self):
        try:
            url = next(self._urls)
        except StopIteration:
            return False
        else:
            self.load(QtCore.QUrl(url))
        return True

    def processCurrentPage(self, html):
        url = self.url().toString()
        # do stuff with html...
        print('loaded: [%d chars] %s' % (len(html), url))
        if not self.fetchNext():
            QtWidgets.qApp.quit()

    def handleLoadFinished(self):
        self.toHtml(self.processCurrentPage)

if __name__ == '__main__':

    # generate some test urls
    urls = []
    url = 'http://pyqt.sourceforge.net/Docs/PyQt5/%s.html'
    for name in dir(QtWebEngineWidgets):
        if name.startswith('Q'):
            urls.append(url % name.lower())

    app = QtWidgets.QApplication(sys.argv)
    webpage = WebPage()
    webpage.start(urls)
    sys.exit(app.exec_())

这篇关于使用QWebPage抓取多个网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆