Web scraping with jsoup in Kotlin
I am trying to scrape this website as part of my lesson to learn Kotlin and Web scraping with jsoup.
What I am trying to scrape is the Jackpot $1,000,000 est.
values.
The below code was something that I wrote after searching and checking out a couple of tutorials online, but it won't even give me $1,000,000
(which was what this code was trying to scrape).
Jsoup.connect("https://online.singaporepools.com/lottery/en/home")
.get()
.run {
select("div.slab__text slab__text--highlight").forEachIndexed { i, element ->
val titleAnchor = element.select("div")
val title = titleAnchor.text()
println("$i. $title")
}
}
My first thought is that maybe this website is using JavaScript. That's why it was not successful.
How should I be going about scraping it?
I was able to scrape what you were looking for from this page on that same site.
Even if it's not what you want, the procedure may help someone in the future.
Here is how I did that:
- First I opened that page
- Then I opened the Chrome developer tools by pressing CTRL+
SHIFT+i or
by right-clicking somewhere on page and selecting Inspect or
by clicking ⋮ ➜ More tools ➜ Developer tools - Next I selected the Network tab
- And finally I refreshed the page with F5 or with the refresh button ⟳
A list of requests start to appear (network log) and after, say, a few seconds, all requests will complete executing. Here, we want to look for and inspect a request that has a Type like xhr. We can filter requests by clicking the filter icon and then selecting the desired type.
To inspect a request, click on its name (first column from left):
Clicking on one of the XHR requests, and then selecting the Response tab shows that the response contains exactly what we are looking for. And it is HTML, so jsoup can parse it:
Here is that response (if you want to copy or manipulate it):
<div style='vertical-align:top;'>
<div>
<div style='float:left; width:120px; font-weight:bold;'>
Next Jackpot
</div>
<span style='color:#EC243D; font-weight:bold'>$8,000,000 est</span>
</div>
<div>
<div style='float:left; width:120px; font-weight:bold;'>
Next Draw
</div>
<div class='toto-draw-date'>Mon, 15 Nov 2021 , 9.30pm</div>
</div>
</div>
By selecting the Headers tab (to the left of the Response tab), we see the Request URL is https://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/toto_next_draw_estimate_en.html?v=2021y11m14d21h0m
and the Request Method is GET and agian the Content-Type is text/html.
So, with the URL and the HTTP method we found, here is the code to scrape that HTML:
val document = Jsoup
.connect("https://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/toto_next_draw_estimate_en.html?v=2021y11m14d21h0m")
.userAgent("Mozilla")
.get()
val targetElement = document
.body()
.children()
.single()
val phrase = targetElement.child(0).text()
val prize = targetElement.select("span").text().removeSuffix(" est")
println(phrase) // Next Jackpot $8,000,000 est
println(prize) // $8,000,000
Here is another solution for parsing a dynamic page with Selenium and jsoup.
We first get and store the page with Selenium and then parse it with jsoup.
Just make sure to download the browser driver and move its executable file to your classpath.
I downloaded the Chrome driver version 95 and placed it along my Kotlin .kts script.
System.setProperty("webdriver.chrome.driver", "chromedriver.exe")
val result = File("output.html")
// OR FirefoxDriver(); download its driver and set the appropriate system property above
val driver = ChromeDriver()
driver.get ("https://www.singaporepools.com.sg/en/product/sr/Pages/toto_results.aspx")
result.writeText(driver.pageSource)
driver.close()
val document = Jsoup.parse(result, "UTF-8")
val targetElement = document
.body()
.children()
.select(":containsOwn(Next Jackpot)")
.single()
.parent()!!
val phrase = targetElement.text()
val prize = targetElement.select("span").text().removeSuffix(" est")
println(phrase) // Next Jackpot $8,000,000 est
println(prize) // $8,000,000
Another version of code for getting the target element:
val targetElement = document
.body()
.selectFirst(":containsOwn(Next Jackpot)")
?.parent()!!
I only used the following dependencies:
- org.seleniumhq.selenium:selenium-java:4.0.0
- org.jsoup:jsoup:1.14.3
See the standalone script file. It can be executed with Kotlin runner from command line like this:
kotlin my-script.main.kts