如何将网站上的文件保存为 PDF? [英] How to save a file from a website as PDF?
问题描述
由于 VBA 中的 IE 自动化,我正在尝试从网站(职位发布)下载 PDF.我无法生成单个 PDF.
通过访问网页并在 pdf 图标上执行目标另存为"手动执行此操作,我会得到一个有效的 PDF.
到目前为止我拥有的代码(URL 是公开的,我随机选择了报价).
私有声明函数 DownloadFilefromURL Lib "urlmon";_别名URLDownloadToFileA"_(ByVal pCaller As Long, _ByVal szURL As String, _ByVal szFileName As String, _ByVal dwReserved As Long, _ByVal lpfnCB As Long) As LongPrivate Const ERROR_SUCCESS As Long = 0Private Const BINDF_GETNEWESTVERSION As Long = &H10公共函数 DownloadFile(SourceUrl As String, LocalFile As String) As BooleanDownloadFile = DownloadFilefromURL(0&, SourceUrl, LocalFile, BINDF_GETNEWESTVERSION, 0&) = ERROR_SUCCESS结束函数子测试保存PDF()将导航调暗为 SHDocVw.InternetExplorerDim oDoc 作为 MSHTML.HTMLDocument将 MyURL 变暗为字符串设置 oNav = 新的 SHDocVw.InternetExploreroNav.Visible = True'Test Altays Client A (Banque de France)MyURL = https://www.recrutement.banque-france.fr/detail-offre/?NoSource=16001&NoSociete=167&NoOffre=2036788&NoLangue=1"'测试阿勒泰客户端 B (Egis)' MyURL = "https://www.altays-progiciels.com/clicnjob/FicheOffreCand.php?PageCour=1&Liste=Oui&Autonome=0&NoOffre=2037501&RefOffrel=&NoFaml=0&NoParam1l=0;NoParam2l=0&NoParam3l=0&NoParam133l=0&NoParam134l=0&NoParam136l=0&NoEntite1=0&NoEntite=&NoPaysl=0&NoRegionl=0&NoRegionl=0&NoRegionl=0&NoRegionl=0&NoLiamp;NoParam136l=0&NoParam136l ==0&NoTableCCl=0&NoTableCC2l=0&NoTableCC3l=0&NoTableOffreUnl=0&NoTypContratl=0&NoTypContratProl=0&NoStatutOffrel=&NoUtilisateurl=&RechPlein3loNav.navigate MyURL'提供的链接以下载 PDF 格式的工作机会.单击时,PDF 在新选项卡中打开MyURL = "https://www.altays-progiciels.com/clicnjob/ExportPDFFront.php";DownloadFile MyURL, "C:\[...Path...]\test.pdf";结束子
Shadow DOM 和无效链接生成:
您可以点击此页面上的实际下载按钮
下载按钮:
这会打开一个新窗口,这就是 Selenium 很棒的原因.Selenium 有切换到这个新窗口的方法.否则,您可以使用我稍后在答案中详细介绍的 FindWindow 方法来查找 Save As
窗口.
在这个新窗口中,由于无法通过 DOM 获得所需的内容,因此您无法以正常抓取时的方式与按钮进行交互.如果仔细检查,您会看到 pdf 按钮位于
模仿键盘操作:
我正在使用 问题.使用 IE,您不能简单地获取 iframe 的 src 链接并愉快地导航到原始添加的 pdf 打印页面.我相信,您可以做的是发出初始
项目参考:
- Selenium 类型库
`
I'm trying to download PDFs from a website (job posting) thanks to a IE automation in VBA. I don't manage to generate a single PDF.
Doing it manually by going on the web page and doing a 'save target as' on the pdf icon gives me a valid PDF.
The code I have so far (the URLs are public and I've picked up offers at random).
Private Declare Function DownloadFilefromURL Lib "urlmon" _
Alias "URLDownloadToFileA" _
(ByVal pCaller As Long, _
ByVal szURL As String, _
ByVal szFileName As String, _
ByVal dwReserved As Long, _
ByVal lpfnCB As Long) As Long
Private Const ERROR_SUCCESS As Long = 0
Private Const BINDF_GETNEWESTVERSION As Long = &H10
Public Function DownloadFile(SourceUrl As String, LocalFile As String) As Boolean
DownloadFile = DownloadFilefromURL(0&, SourceUrl, LocalFile, BINDF_GETNEWESTVERSION, 0&) = ERROR_SUCCESS
End Function
Sub TestSavePDF()
Dim oNav As SHDocVw.InternetExplorer
Dim oDoc As MSHTML.HTMLDocument
Dim MyURL As String
Set oNav = New SHDocVw.InternetExplorer
oNav.Visible = True
'Test Altays Client A (Banque de France)
MyURL = "https://www.recrutement.banque-france.fr/detail-offre/?NoSource=16001&NoSociete=167&NoOffre=2036788&NoLangue=1"
'Test Altays Client B (Egis)
' MyURL = "https://www.altays-progiciels.com/clicnjob/FicheOffreCand.php?PageCour=1&Liste=Oui&Autonome=0&NoOffre=2037501&RefOffrel=&NoFaml=0&NoParam1l=0&NoParam2l=0&NoParam3l=0&NoParam133l=0&NoParam134l=0&NoParam136l=0&NoEntite1=0&NoEntite=&NoPaysl=0&NoRegionl=0&NoDepartementl=0&NoTableOffreLieePl=0&NoTableOffreLieeFl=0&NoNivEtl=0&NoTableCCl=0&NoTableCC2l=0&NoTableCC3l=0&NoTableOffreUnl=0&NoTypContratl=0&NoTypContratProl=0&NoStatutOffrel=&NoUtilisateurl=&RechPleinTextel=#ancre3"
oNav.navigate MyURL
'link provided to download the job offer in PDF. when clicked the PDF opens in a new tab
MyURL = "https://www.altays-progiciels.com/clicnjob/ExportPDFFront.php"
DownloadFile MyURL, "C:\[...Path...]\test.pdf"
End Sub
Shadow DOM and invalid link generation:
The initial job page automated clicking on the target href doesn't generate a viable page link. This is presumably because the important stuff actually happens server side.
Target href:
You can click the actual download button on this page
Download button:
This launches a new window which is why Selenium is great. Selenium has methods to switch to this new Window. Otherwise, you can use the FindWindow methods I detail later in the answer for finding the Save As
window.
In this new window you cannot interact with the buttons in the way you can normally when scraping as the required content is not available via the DOM. If you examine closely you will see the pdf button is in shadow-root
i.e. where you cannot access. This is a design choice. I do need to investigate this possibility (selecting through the shadow DOM using '/deep/' combinator) at some point but I don't think it holds true in VBA.
Download button in Shadow root:
Mimicking keyboard actions:
I am using selenium basic VBA wrapper and APIs to mimic the actions on screen to save as pdf using the Save As
Window (see image at very bottom) . Particularly making use of Save
keyboard shortcut via SendKeys
. This works.
I used Spy++
to check the Window tree structure and check Window Class
names and Titles
.
I use SendKeys
to automate the opening of the Save As
dialog for the pdf. I then descend the Window tree structure to get handles on the ComboBox where the file name is entered, so I can send a message i.e. file name to it, and on the Save
button so I can click it. You may need a longer wait to ensure download goes through correctly. This bit is a little buggy in my opinion and I hope to improve.
Window Structure via Spy++
It is fairly robust. I used Selenium Basic for the ease of working with iframes and getting round same origin policy problems. With IE you cannot simply grab the src link of the iframe and happily navigate onto the page for the pdf print from the original add. What you can do, I believe, is issue an initial XMLHTTP request and grab the src
attribute value i.e. link. Then pass that src
link to IE and then carry on as shown below for the Windows handling parts.
With more time I could add the IE version in and will look at a more robust method, than explicit wait time adding, for monitoring for file download before quitting the IE instance. Likely along the lines of this (As stated in one of the answers: Use SetWindowsHookEx
to set up a WH_SHELL
hook and look for the HSHELL_WINDOWCREATED
event.)
Notes:
- This is written for 64 bit. 32 Bit remove
PtrSafe
. You could switchLongPtr
forLong
but I think it remains compatible. - Huge thanks to @ErikvonAsmuth for his enormous patience in going through the APIs with me. Take a look at his excellent answer here for working with Windows.
VBA:
Option Explicit
Declare PtrSafe Function SendMessageW Lib "User32" (ByVal hWnd As LongPtr, ByVal wMsg As LongPtr, ByVal wParam As LongPtr, ByVal lParam As LongPtr) As LongPtr
Declare PtrSafe Function FindWindowExW Lib "User32" (ByVal hWndParent As LongPtr, _
Optional ByVal hwndChildAfter As LongPtr, Optional ByVal lpszClass As LongPtr, _
Optional ByVal lpszWindow As LongPtr) As LongPtr
Public Declare PtrSafe Function FindWindowW Lib "User32" (ByVal lpClassName As LongPtr, Optional ByVal lpWindowName As LongPtr) As LongPtr
Public Const WM_SETTEXT = &HC
Public Const BM_CLICK = &HF5
Public Sub GetInfo()
Dim d As WebDriver, keys As New Selenium.keys
Const MAX_WAIT_SEC As Long = 5
Dim t As Date
Set d = New ChromeDriver
Const URL = "https://www.recrutement.banque-france.fr/detail-offre/charge-de-recrutement-confirme-h-f-2037343/"
With d
.start "Chrome"
.get URL
.SwitchToFrame .FindElementById("altiframe")
.FindElementById("btn-pdf").Click
.SwitchToNextWindow
.SendKeys keys.Control, "s"
Dim str1 As String, cls As String, name As String
Dim ptrSaveWindow As LongPtr
str1 = "#32770" & vbNullChar
t = Timer
Do
DoEvents
ptrSaveWindow = FindWindowW(StrPtr(str1))
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ptrSaveWindow = 0
Dim duiViewWND As LongPtr, directUIHWND As LongPtr
Dim floatNotifySinkHWND As LongPtr, comboBoxHWND As LongPtr, editHWND As LongPtr
If Not ptrSaveWindow > 0 Then Exit Sub
duiViewWND = FindWindowExW(ptrSaveWindow, 0&)
If Not duiViewWND > 0 Then Exit Sub
directUIHWND = FindWindowExW(duiViewWND, 0&)
If Not directUIHWND > 0 Then Exit Sub
floatNotifySinkHWND = FindWindowExW(directUIHWND, 0&)
If Not floatNotifySinkHWND > 0 Then Exit Sub
comboBoxHWND = FindWindowExW(floatNotifySinkHWND, 0&)
If Not comboBoxHWND > 0 Then Exit Sub
editHWND = FindWindowExW(comboBoxHWND, 0&)
If Not editHWND > 0 Then Exit Sub
Dim msg As String
msg = "myTest.pdf" & vbNullChar
SendMessageW editHWND, WM_SETTEXT, 0, StrPtr(msg)
.SendKeys keys.Control, "s"
Dim ptrSaveButton As LongPtr
cls = "Button" & vbNullChar
name = "&Save" & vbNullChar
ptrSaveButton = FindWindowExW(ptrSaveWindow, 0, StrPtr(cls), StrPtr(name))
SendMessageW ptrSaveButton, BM_CLICK, 0, 0
Application.Wait Now + TimeSerial(0, 0, 4)
.Quit
End With
End Sub
Save As Dialog Window:
References:
- Shadow DOM
- Using shadow DOM - Developer Mozilla pages.
- Accessing shadow-root when marked open - selenium
Project references:
- Selenium Type Library
`
这篇关于如何将网站上的文件保存为 PDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!