How can I scrape specific data from a website

I'm trying to scrape data from a website for research.

The urls are nicely organized in an example.com/x format, with x as an ascending number and all of the pages are structured in the same way. I just need to grab certain headings and a few numbers which are always in the same locations. I'll then need to get this data into structured form for analysis in Excel.

I have used wget before to download pages, but I can't figure out how to grab specific lines of text.

Excel has a feature to grab data from the web (Data->From Web) but from what I can see it only allows me to download tables. Unfortunately, the data I need is not in tables.


Solution 1:

You can use iMacros to automate this task. It is a tool to automate repetative tasks, it can be used to extract data from a site and save it as a CSV file.

As x in example.com/x follows a pattern (i.e. ascending), you can easily program (even without programming exeprience) it to visit each page and extract data from it.

Read more at iMacros Tutorial.

Alternatively you can also try ScraperWiki (some programming experience required).

Solution 2:

You can paste this code in a module:

Option Explicit
Sub get_data()
Dim result As String
Dim myURL As String
Dim winHttpReq As Object
Set winHttpReq = CreateObject("WinHttp.WinHttpRequest.5.1")
myURL = "example.com/x format"

winHttpReq.Open "GET", myURL, False
winHttpReq.Send
result = winHttpReq.responseText
Application.ScreenUpdating = True

Range("Sheet1!A1").value = result
End sub

You can simply parse the long string in cell A1 to extract the data you need.