How to save a file from a website as PDF?
I'm trying to download PDFs from a website (job posting) thanks to a IE automation in VBA. I don't manage to generate a single PDF.
Doing it manually by going on the web page and doing a 'save target as' on the pdf icon gives me a valid PDF.
The code I have so far (the URLs are public and I've picked up offers at random).
Private Declare Function DownloadFilefromURL Lib "urlmon" _
Alias "URLDownloadToFileA" _
(ByVal pCaller As Long, _
ByVal szURL As String, _
ByVal szFileName As String, _
ByVal dwReserved As Long, _
ByVal lpfnCB As Long) As Long
Private Const ERROR_SUCCESS As Long = 0
Private Const BINDF_GETNEWESTVERSION As Long = &H10
Public Function DownloadFile(SourceUrl As String, LocalFile As String) As Boolean
DownloadFile = DownloadFilefromURL(0&, SourceUrl, LocalFile, BINDF_GETNEWESTVERSION, 0&) = ERROR_SUCCESS
End Function
Sub TestSavePDF()
Dim oNav As SHDocVw.InternetExplorer
Dim oDoc As MSHTML.HTMLDocument
Dim MyURL As String
Set oNav = New SHDocVw.InternetExplorer
oNav.Visible = True
'Test Altays Client A (Banque de France)
MyURL = "https://www.recrutement.banque-france.fr/detail-offre/?NoSource=16001&NoSociete=167&NoOffre=2036788&NoLangue=1"
'Test Altays Client B (Egis)
' MyURL = "https://www.altays-progiciels.com/clicnjob/FicheOffreCand.php?PageCour=1&Liste=Oui&Autonome=0&NoOffre=2037501&RefOffrel=&NoFaml=0&NoParam1l=0&NoParam2l=0&NoParam3l=0&NoParam133l=0&NoParam134l=0&NoParam136l=0&NoEntite1=0&NoEntite=&NoPaysl=0&NoRegionl=0&NoDepartementl=0&NoTableOffreLieePl=0&NoTableOffreLieeFl=0&NoNivEtl=0&NoTableCCl=0&NoTableCC2l=0&NoTableCC3l=0&NoTableOffreUnl=0&NoTypContratl=0&NoTypContratProl=0&NoStatutOffrel=&NoUtilisateurl=&RechPleinTextel=#ancre3"
oNav.navigate MyURL
'link provided to download the job offer in PDF. when clicked the PDF opens in a new tab
MyURL = "https://www.altays-progiciels.com/clicnjob/ExportPDFFront.php"
DownloadFile MyURL, "C:\[...Path...]\test.pdf"
End Sub
Solution 1:
Shadow DOM and invalid link generation:
The initial job page automated clicking on the target href doesn't generate a viable page link. This is presumably because the important stuff actually happens server side.
Target href:
You can click the actual download button on this page
Download button:
This launches a new window which is why Selenium is great. Selenium has methods to switch to this new Window. Otherwise, you can use the FindWindow methods I detail later in the answer for finding the Save As
window.
In this new window you cannot interact with the buttons in the way you can normally when scraping as the required content is not available via the DOM. If you examine closely you will see the pdf button is in shadow-root
i.e. where you cannot access. This is a design choice. I do need to investigate this possibility (selecting through the shadow DOM using '/deep/' combinator) at some point but I don't think it holds true in VBA.
Download button in Shadow root:
Mimicking keyboard actions:
I am using selenium basic VBA wrapper and APIs to mimic the actions on screen to save as pdf using the Save As
Window (see image at very bottom) . Particularly making use of Save
keyboard shortcut via SendKeys
. This works.
I used Spy++
to check the Window tree structure and check Window Class
names and Titles
.
I use SendKeys
to automate the opening of the Save As
dialog for the pdf. I then descend the Window tree structure to get handles on the ComboBox where the file name is entered, so I can send a message i.e. file name to it, and on the Save
button so I can click it. You may need a longer wait to ensure download goes through correctly. This bit is a little buggy in my opinion and I hope to improve.
Window Structure via Spy++
It is fairly robust. I used Selenium Basic for the ease of working with iframes and getting round same origin policy problems. With IE you cannot simply grab the src link of the iframe and happily navigate onto the page for the pdf print from the original add. What you can do, I believe, is issue an initial XMLHTTP request and grab the src
attribute value i.e. link. Then pass that src
link to IE and then carry on as shown below for the Windows handling parts.
With more time I could add the IE version in and will look at a more robust method, than explicit wait time adding, for monitoring for file download before quitting the IE instance. Likely along the lines of this (As stated in one of the answers: Use SetWindowsHookEx
to set up a WH_SHELL
hook and look for the HSHELL_WINDOWCREATED
event.)
Notes:
- This is written for 64 bit. 32 Bit remove
PtrSafe
. You could switchLongPtr
forLong
but I think it remains compatible. - Huge thanks to @ErikvonAsmuth for his enormous patience in going through the APIs with me. Take a look at his excellent answer here for working with Windows.
VBA:
Option Explicit
Declare PtrSafe Function SendMessageW Lib "User32" (ByVal hWnd As LongPtr, ByVal wMsg As LongPtr, ByVal wParam As LongPtr, ByVal lParam As LongPtr) As LongPtr
Declare PtrSafe Function FindWindowExW Lib "User32" (ByVal hWndParent As LongPtr, _
Optional ByVal hwndChildAfter As LongPtr, Optional ByVal lpszClass As LongPtr, _
Optional ByVal lpszWindow As LongPtr) As LongPtr
Public Declare PtrSafe Function FindWindowW Lib "User32" (ByVal lpClassName As LongPtr, Optional ByVal lpWindowName As LongPtr) As LongPtr
Public Const WM_SETTEXT = &HC
Public Const BM_CLICK = &HF5
Public Sub GetInfo()
Dim d As WebDriver, keys As New Selenium.keys
Const MAX_WAIT_SEC As Long = 5
Dim t As Date
Set d = New ChromeDriver
Const URL = "https://www.recrutement.banque-france.fr/detail-offre/charge-de-recrutement-confirme-h-f-2037343/"
With d
.start "Chrome"
.get URL
.SwitchToFrame .FindElementById("altiframe")
.FindElementById("btn-pdf").Click
.SwitchToNextWindow
.SendKeys keys.Control, "s"
Dim str1 As String, cls As String, name As String
Dim ptrSaveWindow As LongPtr
str1 = "#32770" & vbNullChar
t = Timer
Do
DoEvents
ptrSaveWindow = FindWindowW(StrPtr(str1))
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop While ptrSaveWindow = 0
Dim duiViewWND As LongPtr, directUIHWND As LongPtr
Dim floatNotifySinkHWND As LongPtr, comboBoxHWND As LongPtr, editHWND As LongPtr
If Not ptrSaveWindow > 0 Then Exit Sub
duiViewWND = FindWindowExW(ptrSaveWindow, 0&)
If Not duiViewWND > 0 Then Exit Sub
directUIHWND = FindWindowExW(duiViewWND, 0&)
If Not directUIHWND > 0 Then Exit Sub
floatNotifySinkHWND = FindWindowExW(directUIHWND, 0&)
If Not floatNotifySinkHWND > 0 Then Exit Sub
comboBoxHWND = FindWindowExW(floatNotifySinkHWND, 0&)
If Not comboBoxHWND > 0 Then Exit Sub
editHWND = FindWindowExW(comboBoxHWND, 0&)
If Not editHWND > 0 Then Exit Sub
Dim msg As String
msg = "myTest.pdf" & vbNullChar
SendMessageW editHWND, WM_SETTEXT, 0, StrPtr(msg)
.SendKeys keys.Control, "s"
Dim ptrSaveButton As LongPtr
cls = "Button" & vbNullChar
name = "&Save" & vbNullChar
ptrSaveButton = FindWindowExW(ptrSaveWindow, 0, StrPtr(cls), StrPtr(name))
SendMessageW ptrSaveButton, BM_CLICK, 0, 0
Application.Wait Now + TimeSerial(0, 0, 4)
.Quit
End With
End Sub
Save As Dialog Window:
References:
- Shadow DOM
- Using shadow DOM - Developer Mozilla pages.
- Accessing shadow-root when marked open - selenium
Project references:
- Selenium Type Library
`