I'm trying to download PDFs from a website (job posting) thanks to a IE automation in VBA. I don't manage to generate a single PDF.

Doing it manually by going on the web page and doing a 'save target as' on the pdf icon gives me a valid PDF.

The code I have so far (the URLs are public and I've picked up offers at random).

Private Declare Function DownloadFilefromURL Lib "urlmon" _
Alias "URLDownloadToFileA" _
(ByVal pCaller As Long, _
ByVal szURL As String, _
ByVal szFileName As String, _
ByVal dwReserved As Long, _
ByVal lpfnCB As Long) As Long

Private Const ERROR_SUCCESS As Long = 0
Private Const BINDF_GETNEWESTVERSION As Long = &H10


Public Function DownloadFile(SourceUrl As String, LocalFile As String) As Boolean
    DownloadFile = DownloadFilefromURL(0&, SourceUrl, LocalFile, BINDF_GETNEWESTVERSION, 0&) = ERROR_SUCCESS
End Function


Sub TestSavePDF()
    Dim oNav As SHDocVw.InternetExplorer
    Dim oDoc As MSHTML.HTMLDocument
    Dim MyURL As String

    Set oNav = New SHDocVw.InternetExplorer
    oNav.Visible = True
    'Test Altays Client A (Banque de France)
    MyURL = "https://www.recrutement.banque-france.fr/detail-offre/?NoSource=16001&NoSociete=167&NoOffre=2036788&NoLangue=1"
    'Test Altays Client B (Egis)
    '        MyURL = "https://www.altays-progiciels.com/clicnjob/FicheOffreCand.php?PageCour=1&Liste=Oui&Autonome=0&NoOffre=2037501&RefOffrel=&NoFaml=0&NoParam1l=0&NoParam2l=0&NoParam3l=0&NoParam133l=0&NoParam134l=0&NoParam136l=0&NoEntite1=0&NoEntite=&NoPaysl=0&NoRegionl=0&NoDepartementl=0&NoTableOffreLieePl=0&NoTableOffreLieeFl=0&NoNivEtl=0&NoTableCCl=0&NoTableCC2l=0&NoTableCC3l=0&NoTableOffreUnl=0&NoTypContratl=0&NoTypContratProl=0&NoStatutOffrel=&NoUtilisateurl=&RechPleinTextel=#ancre3"

    
    oNav.navigate MyURL
    'link provided to download the job offer in PDF. when clicked the PDF opens in a new tab
    MyURL = "https://www.altays-progiciels.com/clicnjob/ExportPDFFront.php"

    DownloadFile MyURL, "C:\[...Path...]\test.pdf"
    
End Sub

Solution 1:

Shadow DOM and invalid link generation:

The initial job page automated clicking on the target href doesn't generate a viable page link. This is presumably because the important stuff actually happens server side.

Target href:

enter image description here

You can click the actual download button on this page

Download button:

enter image description here

This launches a new window which is why Selenium is great. Selenium has methods to switch to this new Window. Otherwise, you can use the FindWindow methods I detail later in the answer for finding the Save As window.

In this new window you cannot interact with the buttons in the way you can normally when scraping as the required content is not available via the DOM. If you examine closely you will see the pdf button is in shadow-root i.e. where you cannot access. This is a design choice. I do need to investigate this possibility (selecting through the shadow DOM using '/deep/' combinator) at some point but I don't think it holds true in VBA.

Download button in Shadow root:

enter image description here


Mimicking keyboard actions:

I am using selenium basic VBA wrapper and APIs to mimic the actions on screen to save as pdf using the Save As Window (see image at very bottom) . Particularly making use of Save keyboard shortcut via SendKeys. This works. I used Spy++ to check the Window tree structure and check Window Class names and Titles.

I use SendKeys to automate the opening of the Save As dialog for the pdf. I then descend the Window tree structure to get handles on the ComboBox where the file name is entered, so I can send a message i.e. file name to it, and on the Save button so I can click it. You may need a longer wait to ensure download goes through correctly. This bit is a little buggy in my opinion and I hope to improve.

Window Structure via Spy++

It is fairly robust. I used Selenium Basic for the ease of working with iframes and getting round same origin policy problems. With IE you cannot simply grab the src link of the iframe and happily navigate onto the page for the pdf print from the original add. What you can do, I believe, is issue an initial XMLHTTP request and grab the src attribute value i.e. link. Then pass that src link to IE and then carry on as shown below for the Windows handling parts.

With more time I could add the IE version in and will look at a more robust method, than explicit wait time adding, for monitoring for file download before quitting the IE instance. Likely along the lines of this (As stated in one of the answers: Use SetWindowsHookEx to set up a WH_SHELL hook and look for the HSHELL_WINDOWCREATED event.)


Notes:

  1. This is written for 64 bit. 32 Bit remove PtrSafe. You could switch LongPtr for Longbut I think it remains compatible.
  2. Huge thanks to @ErikvonAsmuth for his enormous patience in going through the APIs with me. Take a look at his excellent answer here for working with Windows.

VBA:

Option Explicit

Declare PtrSafe Function SendMessageW Lib "User32" (ByVal hWnd As LongPtr, ByVal wMsg As LongPtr, ByVal wParam As LongPtr, ByVal lParam As LongPtr) As LongPtr
 
Declare PtrSafe Function FindWindowExW Lib "User32" (ByVal hWndParent As LongPtr, _
                                                     Optional ByVal hwndChildAfter As LongPtr, Optional ByVal lpszClass As LongPtr, _
                                                     Optional ByVal lpszWindow As LongPtr) As LongPtr
                                                 
Public Declare PtrSafe Function FindWindowW Lib "User32" (ByVal lpClassName As LongPtr, Optional ByVal lpWindowName As LongPtr) As LongPtr

Public Const WM_SETTEXT = &HC
Public Const BM_CLICK = &HF5

Public Sub GetInfo()
    Dim d As WebDriver, keys As New Selenium.keys
    Const MAX_WAIT_SEC As Long = 5
    Dim t As Date
    
    Set d = New ChromeDriver
    Const URL = "https://www.recrutement.banque-france.fr/detail-offre/charge-de-recrutement-confirme-h-f-2037343/"
    With d
        .start "Chrome"
        .get URL
        .SwitchToFrame .FindElementById("altiframe")
        .FindElementById("btn-pdf").Click
        .SwitchToNextWindow
        .SendKeys keys.Control, "s"
        
        Dim str1 As String, cls As String, name As String
        Dim ptrSaveWindow As LongPtr
        
        str1 = "#32770" & vbNullChar
        
        t = Timer
        Do
            DoEvents
            ptrSaveWindow = FindWindowW(StrPtr(str1))
            If Timer - t > MAX_WAIT_SEC Then Exit Do
        Loop While ptrSaveWindow = 0
             
        Dim duiViewWND As LongPtr, directUIHWND As LongPtr
        Dim floatNotifySinkHWND As LongPtr, comboBoxHWND As LongPtr, editHWND As LongPtr


        If Not ptrSaveWindow > 0 Then Exit Sub
        
        duiViewWND = FindWindowExW(ptrSaveWindow, 0&)
        
        If Not duiViewWND > 0 Then Exit Sub
        
        directUIHWND = FindWindowExW(duiViewWND, 0&)
        
        If Not directUIHWND > 0 Then Exit Sub
        
        floatNotifySinkHWND = FindWindowExW(directUIHWND, 0&)
        
        If Not floatNotifySinkHWND > 0 Then Exit Sub
        
        comboBoxHWND = FindWindowExW(floatNotifySinkHWND, 0&)

        If Not comboBoxHWND > 0 Then Exit Sub
        
        editHWND = FindWindowExW(comboBoxHWND, 0&)
        
        If Not editHWND > 0 Then Exit Sub
        
        Dim msg As String
        msg = "myTest.pdf" & vbNullChar
        
        SendMessageW editHWND, WM_SETTEXT, 0, StrPtr(msg)

        .SendKeys keys.Control, "s"
        
        Dim ptrSaveButton As LongPtr
        cls = "Button" & vbNullChar
        name = "&Save" & vbNullChar

        ptrSaveButton = FindWindowExW(ptrSaveWindow, 0, StrPtr(cls), StrPtr(name))
      
        SendMessageW ptrSaveButton, BM_CLICK, 0, 0
             
        Application.Wait Now + TimeSerial(0, 0, 4)
        
        .Quit
    End With
End Sub

Save As Dialog Window:


References:

  1. Shadow DOM
  2. Using shadow DOM - Developer Mozilla pages.
  3. Accessing shadow-root when marked open - selenium

Project references:

  1. Selenium Type Library

`