如何将网站上的文件保存为 PDF? [英] How to save a file from a website as PDF?

查看:40
本文介绍了如何将网站上的文件保存为 PDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于 VBA 中的 IE 自动化,我正在尝试从网站(职位发布)下载 PDF.我无法生成单个 PDF.

通过访问网页并在 pdf 图标上执行目标另存为"手动执行此操作,我会得到一个有效的 PDF.

到目前为止我拥有的代码(URL 是公开的,我随机选择了报价).

私有声明函数 DownloadFilefromURL Lib "urlmon";_别名URLDownloadToFileA"_(ByVal pCaller As Long, _ByVal szURL As String, _ByVal szFileName As String, _ByVal dwReserved As Long, _ByVal lpfnCB As Long) As LongPrivate Const ERROR_SUCCESS As Long = 0Private Const BINDF_GETNEWESTVERSION As Long = &H10公共函数 DownloadFile(SourceUrl As String, LocalFile As String) As BooleanDownloadFile = DownloadFilefromURL(0&, SourceUrl, LocalFile, BINDF_GETNEWESTVERSION, 0&) = ERROR_SUCCESS结束函数子测试保存PDF()将导航调暗为 SHDocVw.InternetExplorerDim oDoc 作为 MSHTML.HTMLDocument将 MyURL 变暗为字符串设置 oNav = 新的 SHDocVw.InternetExploreroNav.Visible = True'Test Altays Client A (Banque de France)MyURL = https://www.recrutement.banque-france.fr/detail-offre/?NoSource=16001&NoSociete=167&NoOffre=2036788&NoLangue=1"'测试阿勒泰客户端 B (Egis)' MyURL = "https://www.altays-progiciels.com/clicnjob/FicheOffreCand.php?PageCour=1&Liste=Oui&Autonome=0&NoOffre=2037501&RefOffrel=&NoFaml=0&NoParam1l=0;NoParam2l=0&NoParam3l=0&NoParam133l=0&NoParam134l=0&NoParam136l=0&NoEntite1=0&NoEntite=&NoPaysl=0&NoRegionl=0&NoRegionl=0&NoRegionl=0&NoRegionl=0&NoLiamp;NoParam136l=0&NoParam136l ==0&NoTableCCl=0&NoTableCC2l=0&NoTableCC3l=0&NoTableOffreUnl=0&NoTypContratl=0&NoTypContratProl=0&NoStatutOffrel=&NoUtilisateurl=&RechPlein3loNav.navigate MyURL'提供的链接以下载 PDF 格式的工作机会.单击时,PDF 在新选项卡中打开MyURL = "https://www.altays-progiciels.com/clicnjob/ExportPDFFront.php";DownloadFile MyURL, "C:\[...Path...]\test.pdf";结束子

解决方案

Shadow DOM 和无效链接生成:

您可以点击此页面上的实际下载按钮

下载按钮:

这会打开一个新窗口,这就是 Selenium 很棒的原因.Selenium 有切换到这个新窗口的方法.否则,您可以使用我稍后在答案中详细介绍的 FindWindow 方法来查找 Save As 窗口.

在这个新窗口中,由于无法通过 DOM 获得所需的内容,因此您无法以正常抓取时的方式与按钮进行交互.如果仔细检查,您会看到 pdf 按钮位于


模仿键盘操作:

我正在使用 问题.使用 IE,您不能简单地获取 iframe 的 src 链接并愉快地导航到原始添加的 pdf 打印页面.我相信,您可以做的是发出初始

  • 使用 shadow DOM - 开发者 Mozilla 页面.
  • 在标记为打开时访问 shadow-root - selenium

  • 项目参考:

    1. Selenium 类型库

    `

    I'm trying to download PDFs from a website (job posting) thanks to a IE automation in VBA. I don't manage to generate a single PDF.

    Doing it manually by going on the web page and doing a 'save target as' on the pdf icon gives me a valid PDF.

    The code I have so far (the URLs are public and I've picked up offers at random).

    Private Declare Function DownloadFilefromURL Lib "urlmon" _
    Alias "URLDownloadToFileA" _
    (ByVal pCaller As Long, _
    ByVal szURL As String, _
    ByVal szFileName As String, _
    ByVal dwReserved As Long, _
    ByVal lpfnCB As Long) As Long
    
    Private Const ERROR_SUCCESS As Long = 0
    Private Const BINDF_GETNEWESTVERSION As Long = &H10
    
    
    Public Function DownloadFile(SourceUrl As String, LocalFile As String) As Boolean
        DownloadFile = DownloadFilefromURL(0&, SourceUrl, LocalFile, BINDF_GETNEWESTVERSION, 0&) = ERROR_SUCCESS
    End Function
    
    
    Sub TestSavePDF()
        Dim oNav As SHDocVw.InternetExplorer
        Dim oDoc As MSHTML.HTMLDocument
        Dim MyURL As String
    
        Set oNav = New SHDocVw.InternetExplorer
        oNav.Visible = True
        'Test Altays Client A (Banque de France)
        MyURL = "https://www.recrutement.banque-france.fr/detail-offre/?NoSource=16001&NoSociete=167&NoOffre=2036788&NoLangue=1"
        'Test Altays Client B (Egis)
        '        MyURL = "https://www.altays-progiciels.com/clicnjob/FicheOffreCand.php?PageCour=1&Liste=Oui&Autonome=0&NoOffre=2037501&RefOffrel=&NoFaml=0&NoParam1l=0&NoParam2l=0&NoParam3l=0&NoParam133l=0&NoParam134l=0&NoParam136l=0&NoEntite1=0&NoEntite=&NoPaysl=0&NoRegionl=0&NoDepartementl=0&NoTableOffreLieePl=0&NoTableOffreLieeFl=0&NoNivEtl=0&NoTableCCl=0&NoTableCC2l=0&NoTableCC3l=0&NoTableOffreUnl=0&NoTypContratl=0&NoTypContratProl=0&NoStatutOffrel=&NoUtilisateurl=&RechPleinTextel=#ancre3"
    
        
        oNav.navigate MyURL
        'link provided to download the job offer in PDF. when clicked the PDF opens in a new tab
        MyURL = "https://www.altays-progiciels.com/clicnjob/ExportPDFFront.php"
    
        DownloadFile MyURL, "C:\[...Path...]\test.pdf"
        
    End Sub
    

    解决方案

    Shadow DOM and invalid link generation:

    The initial job page automated clicking on the target href doesn't generate a viable page link. This is presumably because the important stuff actually happens server side.

    Target href:

    You can click the actual download button on this page

    Download button:

    This launches a new window which is why Selenium is great. Selenium has methods to switch to this new Window. Otherwise, you can use the FindWindow methods I detail later in the answer for finding the Save As window.

    In this new window you cannot interact with the buttons in the way you can normally when scraping as the required content is not available via the DOM. If you examine closely you will see the pdf button is in shadow-root i.e. where you cannot access. This is a design choice. I do need to investigate this possibility (selecting through the shadow DOM using '/deep/' combinator) at some point but I don't think it holds true in VBA.

    Download button in Shadow root:


    Mimicking keyboard actions:

    I am using selenium basic VBA wrapper and APIs to mimic the actions on screen to save as pdf using the Save As Window (see image at very bottom) . Particularly making use of Save keyboard shortcut via SendKeys. This works. I used Spy++ to check the Window tree structure and check Window Class names and Titles.

    I use SendKeys to automate the opening of the Save As dialog for the pdf. I then descend the Window tree structure to get handles on the ComboBox where the file name is entered, so I can send a message i.e. file name to it, and on the Save button so I can click it. You may need a longer wait to ensure download goes through correctly. This bit is a little buggy in my opinion and I hope to improve.

    Window Structure via Spy++

    It is fairly robust. I used Selenium Basic for the ease of working with iframes and getting round same origin policy problems. With IE you cannot simply grab the src link of the iframe and happily navigate onto the page for the pdf print from the original add. What you can do, I believe, is issue an initial XMLHTTP request and grab the src attribute value i.e. link. Then pass that src link to IE and then carry on as shown below for the Windows handling parts.

    With more time I could add the IE version in and will look at a more robust method, than explicit wait time adding, for monitoring for file download before quitting the IE instance. Likely along the lines of this (As stated in one of the answers: Use SetWindowsHookEx to set up a WH_SHELL hook and look for the HSHELL_WINDOWCREATED event.)


    Notes:

    1. This is written for 64 bit. 32 Bit remove PtrSafe. You could switch LongPtr for Longbut I think it remains compatible.
    2. Huge thanks to @ErikvonAsmuth for his enormous patience in going through the APIs with me. Take a look at his excellent answer here for working with Windows.


    VBA:

    Option Explicit
    
    Declare PtrSafe Function SendMessageW Lib "User32" (ByVal hWnd As LongPtr, ByVal wMsg As LongPtr, ByVal wParam As LongPtr, ByVal lParam As LongPtr) As LongPtr
     
    Declare PtrSafe Function FindWindowExW Lib "User32" (ByVal hWndParent As LongPtr, _
                                                         Optional ByVal hwndChildAfter As LongPtr, Optional ByVal lpszClass As LongPtr, _
                                                         Optional ByVal lpszWindow As LongPtr) As LongPtr
                                                     
    Public Declare PtrSafe Function FindWindowW Lib "User32" (ByVal lpClassName As LongPtr, Optional ByVal lpWindowName As LongPtr) As LongPtr
    
    Public Const WM_SETTEXT = &HC
    Public Const BM_CLICK = &HF5
    
    Public Sub GetInfo()
        Dim d As WebDriver, keys As New Selenium.keys
        Const MAX_WAIT_SEC As Long = 5
        Dim t As Date
        
        Set d = New ChromeDriver
        Const URL = "https://www.recrutement.banque-france.fr/detail-offre/charge-de-recrutement-confirme-h-f-2037343/"
        With d
            .start "Chrome"
            .get URL
            .SwitchToFrame .FindElementById("altiframe")
            .FindElementById("btn-pdf").Click
            .SwitchToNextWindow
            .SendKeys keys.Control, "s"
            
            Dim str1 As String, cls As String, name As String
            Dim ptrSaveWindow As LongPtr
            
            str1 = "#32770" & vbNullChar
            
            t = Timer
            Do
                DoEvents
                ptrSaveWindow = FindWindowW(StrPtr(str1))
                If Timer - t > MAX_WAIT_SEC Then Exit Do
            Loop While ptrSaveWindow = 0
                 
            Dim duiViewWND As LongPtr, directUIHWND As LongPtr
            Dim floatNotifySinkHWND As LongPtr, comboBoxHWND As LongPtr, editHWND As LongPtr
    
    
            If Not ptrSaveWindow > 0 Then Exit Sub
            
            duiViewWND = FindWindowExW(ptrSaveWindow, 0&)
            
            If Not duiViewWND > 0 Then Exit Sub
            
            directUIHWND = FindWindowExW(duiViewWND, 0&)
            
            If Not directUIHWND > 0 Then Exit Sub
            
            floatNotifySinkHWND = FindWindowExW(directUIHWND, 0&)
            
            If Not floatNotifySinkHWND > 0 Then Exit Sub
            
            comboBoxHWND = FindWindowExW(floatNotifySinkHWND, 0&)
    
            If Not comboBoxHWND > 0 Then Exit Sub
            
            editHWND = FindWindowExW(comboBoxHWND, 0&)
            
            If Not editHWND > 0 Then Exit Sub
            
            Dim msg As String
            msg = "myTest.pdf" & vbNullChar
            
            SendMessageW editHWND, WM_SETTEXT, 0, StrPtr(msg)
    
            .SendKeys keys.Control, "s"
            
            Dim ptrSaveButton As LongPtr
            cls = "Button" & vbNullChar
            name = "&Save" & vbNullChar
    
            ptrSaveButton = FindWindowExW(ptrSaveWindow, 0, StrPtr(cls), StrPtr(name))
          
            SendMessageW ptrSaveButton, BM_CLICK, 0, 0
                 
            Application.Wait Now + TimeSerial(0, 0, 4)
            
            .Quit
        End With
    End Sub
    


    Save As Dialog Window:


    References:

    1. Shadow DOM
    2. Using shadow DOM - Developer Mozilla pages.
    3. Accessing shadow-root when marked open - selenium


    Project references:

    1. Selenium Type Library

    `

    这篇关于如何将网站上的文件保存为 PDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆