Use getElementById on HTMLElement instead of HTMLDocument
I've been playing around with scraping data from web pages using VBS/VBA.
If it were Javascript I'd be away as its easy, but it doesn't seem to be quite as straight forward in VBS/VBA.
This is an example I made for an answer, it works but I had planned on accessing the child nodes using getElementByTagName
but I could not figure out how to use them! The HTMLElement
object does not have those methods.
Sub Scrape()
Dim Browser As InternetExplorer
Dim Document As HTMLDocument
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement
Set Browser = New InternetExplorer
Browser.navigate "http://www.hsbc.com/about-hsbc/leadership"
Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE
DoEvents
Loop
Set Document = Browser.Document
Set Elements = Document.getElementsByClassName("profile-col1")
For Each Element in Elements
Debug.Print "[ name] " & Trim(Element.Children(1).Children(0).innerText)
Debug.Print "[ title] " & Trim(Element.Children(1).Children(1).innerText)
Next Element
Set Document = Nothing
Set Browser = Nothing
End Sub
I have been looking at the HTMLElement.document
property, seeing if it is like a fragment of the document but its either difficult to work with or just isnt what I think
Dim Fragment As HTMLDocument
Set Element = Document.getElementById("example") ' This works
Set Fragment = Element.document ' This doesn't
This also seems a long winded way to do it (although thats usually the way for vba imo). Anyone know if there is a simpler way to chain functions?
Document.getElementById("target").getElementsByTagName("tr")
would be awesome...
Solution 1:
Sub Scrape()
Dim Browser As InternetExplorer
Dim Document As htmlDocument
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement
Set Browser = New InternetExplorer
Browser.Visible = True
Browser.navigate "http://www.stackoverflow.com"
Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE
DoEvents
Loop
Set Document = Browser.Document
Set Elements = Document.getElementById("hmenus").getElementsByTagName("li")
For Each Element In Elements
Debug.Print Element.innerText
'Questions
'Tags
'Users
'Badges
'Unanswered
'Ask Question
Next Element
Set Document = Nothing
Set Browser = Nothing
End Sub
Solution 2:
I don't like it either.
So use javascript:
Public Function GetJavaScriptResult(doc as HTMLDocument, jsString As String) As String
Dim el As IHTMLElement
Dim nd As HTMLDOMTextNode
Set el = doc.createElement("INPUT")
Do
el.ID = GenerateRandomAlphaString(100)
Loop Until Document.getElementById(el.ID) Is Nothing
el.Style.display = "none"
Set nd = Document.appendChild(el)
doc.parentWindow.ExecScript "document.getElementById('" & el.ID & "').value = " & jsString
GetJavaScriptResult = Document.getElementById(el.ID).Value
Document.removeChild nd
End Function
Function GenerateRandomAlphaString(Length As Long) As String
Dim i As Long
Dim Result As String
Randomize Timer
For i = 1 To Length
Result = Result & Chr(Int(Rnd(Timer) * 26 + 65 + Round(Rnd(Timer)) * 32))
Next i
GenerateRandomAlphaString = Result
End Function
Let me know if you have any problems with this; I've changed the context from a method to a function.
By the way, what version of IE are you using? I suspect you're on < IE8. If you upgrade to IE8 I presume it'll update shdocvw.dll to ieframe.dll and you will be able to use document.querySelector/All.
Edit
Comment response which isn't really a comment: Basically the way to do this in VBA is to traverse the child nodes. The problem is you don't get the correct return types. You could fix this by making your own classes that (separately) implement IHTMLElement and IHTMLElementCollection; but that's WAY too much of a pain for me to do it without getting paid :). If you're determined, go and read up on the Implements keyword for VB6/VBA.
Public Function getSubElementsByTagName(el As IHTMLElement, tagname As String) As Collection
Dim descendants As New Collection
Dim results As New Collection
Dim i As Long
getDescendants el, descendants
For i = 1 To descendants.Count
If descendants(i).tagname = tagname Then
results.Add descendants(i)
End If
Next i
getSubElementsByTagName = results
End Function
Public Function getDescendants(nd As IHTMLElement, ByRef descendants As Collection)
Dim i As Long
descendants.Add nd
For i = 1 To nd.Children.Length
getDescendants nd.Children.Item(i), descendants
Next i
End Function
Solution 3:
I would use XMLHTTP request to retrieve page content as much faster. Then it is easy enough to use querySelectorAll to apply a CSS class selector to grab by class name. Then you access the child elements by tag name and index.
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As HTMLDocument, elements As Object, i As Long
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.hsbc.com/about-hsbc/leadership", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
Set elements = .querySelectorAll(".profile-col1")
For i = 0 To elements.Length - 1
Debug.Print String(20, Chr$(61))
Debug.Print elements.item(i).getElementsByTagName("a")(0).innerText
Debug.Print elements.item(i).getElementsByTagName("p")(0).innerText
Debug.Print elements.item(i).getElementsByTagName("p")(1).innerText
Next
End With
End Sub
References:
VBE > Tools > References > Microsoft HTML Object Library