无法使我的脚本异步运行 [英] Unable to make my script work asynchronously

查看:60
本文介绍了无法使我的脚本异步运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在vba中编写了一个脚本,以从

因此,该方法仅揭示了问题,但是,最安全,更有效的方法是使用网站API,如此答案 所述.

I've written a script in vba to scrape different movie names and their genre from a torrent site. Although the name and genre are present in it's landing page, I created the script to parse the same going one layer deep (from their main pages). To be clearer, this is one of such page what I meant by main page. My script is parsing them flawlessly. However, my intention is to do the same making asynchronous requests. Currently the script is doing it's job synchronously (in blocking manner).

In my previous post I got an answer from omegastripes who created a script (which more or less performs like how multiprocessing works) meant to be working asynchronously. So that is where I found the idea but can't implement the same within the following script.

My attempt so far:

Sub GetInfo()
    Const URL = "https://yts.am/browse-movies"
    Dim Http As New ServerXMLHTTP60, Html As New HTMLDocument
    Dim post As HTMLDivElement, oName$, oGenre$, R&
    Dim I&, key As Variant, iDic As Object
    Set iDic = CreateObject("Scripting.Dictionary")

    With Http
        .Open "GET", URL, False
        .send
        Html.body.innerHTML = .responseText
    End With

    With Html.querySelectorAll(".browse-movie-wrap .browse-movie-title")
        For I = 0 To .Length - 1
            iDic(.Item(I).getAttribute("href")) = 1
        Next I
    End With

    For Each key In iDic.keys
        With Http
            .Open "GET", key, False
            .send
            Html.body.innerHTML = .responseText
        End With

        oName = Html.querySelector("h1").innerText
        oGenre = Html.querySelector("h2").NextSibling.innerText
        R = R + 1: Cells(R, 1) = oName
        Cells(R, 2) = oGenre
    Next key
End Sub

How can I bring about any change within my above script in order for it to be working asynchronously?

解决方案

Here is the example showing single loop parser implementation with async requests pool. The code parses all Browse Pages from the first to last and Movies Pages, both types are parsed simultaneously. Movies URLs are parsed from Browse Pages and placed in Movies Queue, then details from each Movie Page from the queue is parsed and output to the worksheet. It handles all HTTP requests errors types and makes retry until the limit.

Put the below code to a standard module:

Option Explicit

Sub Test()

    Const PoolCapacity = 30 ' Async requests qty
    Const MoviesMin = 55 ' Movies in queue + expected movies min qty to request new browse page
    Const ReqDelayMin = 0.15 ' Min delay between requests to avoid ban, sec
    Const ReqTimeout = 15 ' Request timeout, sec
    Const ReqRetryMax = 3 ' Attempts for each request before quit

    Dim oWS As Worksheet
    Dim y As Long
    Dim ocPool As Collection
    Dim ocMovies As Collection
    Dim lMoviesPerPage As Long
    Dim lBPageIndex As Long
    Dim lBPagesInPoolQty As Long
    Dim bLastBPageReached As Boolean
    Dim dPrevReqSent As Double
    Dim i As Long
    Dim bBPageInPool As Boolean
    Dim dT As Double
    Dim bFail As Boolean
    Dim sResp As String
    Dim oMatches As Object
    Dim oMatch
    Dim oReq As Object
    Dim oRequest As cRequest

    ' Prepare worksheet
    Set oWS = ThisWorkbook.Sheets(1)
    oWS.Cells.Delete
    y = 1
    ' Init
    Set ocPool = New Collection ' Requests async pool
    Set ocMovies = New Collection ' Movies urls queue
    lMoviesPerPage = 20 ' Movies per page qty
    lBPageIndex = 1 ' Current browse page index for request
    bLastBPageReached = False ' Last page reached flag
    dPrevReqSent = -60# * 60# * 24# ' Init delay timer
    ' Start parsing
    Do
        lBPagesInPoolQty = 0 ' How many browse pages currently in pool
        ' Check pool for all flagged and completed requests
        For i = ocPool.Count To 1 Step -1
            bBPageInPool = Not ocPool(i).IsMovie
            ' Delay from last request
            dT = Timer - dPrevReqSent
            If dT < 0 Then dT = dT + 60# * 60# * 24#
            Select Case True
                ' Check request has no sent flag
                Case Not ocPool(i).NeedSend
                    On Error Resume Next
                    bFail = False
                    sResp = ""
                    With ocPool(i).HTTPRequest
                        ' Check http request is ready and status is OK
                        Select Case True
                            Case .ReadyState < 4 ' Not ready
                            Case .Status \ 100 <> 2 ' Wrong status
                                Debug.Print .Status & " / " & .StatusText & " (" & ocPool(i).URL & ")"
                                bFail = True
                            Case Else ' Ready and OK
                                sResp = .ResponseText
                        End Select
                    End With
                    If sResp = "" Then
                        ' Request elapsed time
                        dT = Timer - ocPool(i).SendTimer
                        If dT < 0 Then dT = dT + 60# * 60# * 24#
                        ' Check request is failed
                        Select Case True
                            Case Err.Number <> 0 ' Runtime error
                                Debug.Print Err.Number & " / " & Err.Description & " (" & ocPool(i).URL & ")"
                                bFail = True
                            Case dT > ReqTimeout ' Timeout
                                Debug.Print "Timeout (" & ocPool(i).URL & ")"
                                bFail = True
                        End Select
                        On Error GoTo 0
                        If bFail Then ' Request has been failed
                            ocPool(i).FailsCount = ocPool(i).FailsCount + 1
                            ' Check attempts
                            If ocPool(i).FailsCount > ReqRetryMax Then
                                Debug.Print "Quit (" & ocPool(i).URL & ")"
                                ocPool.Remove i ' Quit
                                bBPageInPool = False
                            Else
                                ocPool(i).NeedSend = True ' Raise send flag to retry
                            End If
                        End If
                    Else ' Response received
                        If ocPool(i).IsMovie Then
                            ' Response from movie page
                            With CreateObject("VBScript.RegExp")
                                ' Parse Title, Year, Genre
                                ' <h1 itemprop\="name">___</h1>\s*<h2>___</h2>\s*<h2>___</h2>
                                .Pattern = "<h1 itemprop\=""name"">([^<]*)</h1>\s*<h2>([^<]*)</h2>\s*<h2>([^<]*)</h2>"
                                Set oMatches = .Execute(sResp)
                                If oMatches.Count = 1 Then ' Output to worksheet
                                    oWS.Cells(y, 1).Value = oMatches(0).SubMatches(0)
                                    oWS.Cells(y, 2).Value = oMatches(0).SubMatches(1)
                                    oWS.Cells(y, 3).Value = oMatches(0).SubMatches(2)
                                    y = y + 1
                                End If
                            End With
                        Else
                            ' Response from browse page
                            With CreateObject("VBScript.RegExp")
                                .Global = True
                                ' Parse movies urls
                                ' <a href="___" class="browse-movie-link">
                                .Pattern = "<a href=""([^""]*)"" class=""browse-movie-link"">"
                                Set oMatches = .Execute(sResp)
                                For Each oMatch In oMatches
                                    ocMovies.Add oMatch.SubMatches(0) ' Movies queue fed
                                Next
                                ' Parse next page button
                                ' <a href="/browse-movies?page=___">Next
                                .Pattern = "<a href\=""/browse-movies\?page\=\d+"">Next "
                                bLastBPageReached = bLastBPageReached Or Not .Test(sResp)
                            End With
                            If Not bLastBPageReached Then lMoviesPerPage = oMatches.Count ' Update lMoviesPerPage
                        End If
                        ocPool.Remove i
                        bBPageInPool = False
                    End If
                ' Check request has send flag raised and delay enough
                Case dT > ReqDelayMin
                    ' Send the request
                    Set oReq = CreateObject("MSXML2.ServerXMLHTTP.6.0")
                    With oReq
                        .Open "GET", ocPool(i).URL, True
                        ' .SetProxy 2, "190.12.55.210:46078"
                        .SetRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64)"
                        .Send
                    End With
                    ocPool(i).NeedSend = False
                    ocPool(i).SendTimer = Timer
                    dPrevReqSent = ocPool(i).SendTimer
                    Set ocPool(i).HTTPRequest = oReq
            End Select
            If bBPageInPool Then lBPagesInPoolQty = lBPagesInPoolQty + 1
            DoEvents
        Next
        ' Check if there is a room for a new request in pool
        If ocPool.Count < PoolCapacity Then
            ' Add one new request to pool
            ' Check if movies in queue + expected movies are not enough
            If ocMovies.Count + lBPagesInPoolQty * lMoviesPerPage < MoviesMin And Not bLastBPageReached Then
                ' Add new request for next browse page to feed movie queue
                Set oRequest = New cRequest
                With oRequest
                    .URL = "https://yts.am/browse-movies?page=" & lBPageIndex
                    .IsMovie = False
                    .NeedSend = True
                    .FailsCount = 0
                End With
                ocPool.Add oRequest
                lBPageIndex = lBPageIndex + 1
            Else
                ' Check if movie page urls are parsed and available in queue
                If ocMovies.Count > 0 Then
                    ' Add new request for next movie page from queue
                    Set oRequest = New cRequest
                    With oRequest
                        .URL = ocMovies(1)
                        .IsMovie = True
                        .NeedSend = True
                        .FailsCount = 0
                    End With
                    ocPool.Add oRequest
                    ocMovies.Remove 1
                End If
            End If
        End If
        DoEvents
    Loop While ocPool.Count > 0 ' Loop until the last request completed
    MsgBox "Completed"

End Sub

Put the below code to a class module named cRequest:

Public URL As String
Public IsMovie As Boolean
Public NeedSend As Boolean
Public SendTimer As Double
Public HTTPRequest As Object
Public FailsCount As Long

Reduce delay between requests Const ReqDelayMin with care. Once launched with a high rate for me it worked for a while and caused Cloudflare DDoS protection to trigger, and currently, I'm unable to make the code work directly from my IP, the only way is to use a proxy for the requests (you can see the commented line with .SetProxy). Even in Chrome, I'm getting Cloudflare redirection now:

Thus the approach just reveals the question, however, the safest and much more efficient way is to use the website API as described in this answer.

这篇关于无法使我的脚本异步运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆