jsoup

Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "Java HTML parsers". One of them is Jsoup.

You can navigate the page using DOM if you know the page structure, see http://jsoup.org/cookbook/extracting-data/dom-navigation

It's a good library and I've used it in my last projects.


Your best bet is to use Selenium Web Driver since it

  1. Provides visual feedback to the coder (see your scraping in action, see where it stops)

  2. Accurate and Consistent as it directly controls the browser you use.

  3. Slow. Doesn't hit web pages like HtmlUnit does but sometimes you don't want to hit too fast.

    Htmlunit is fast but is horrible at handling Javascript and AJAX.


HTMLUnit can be used to do web scraping, it supports invoking pages, filling & submitting forms. I have used this in my project. It is good java library for web scraping. read here for more


mechanize for Java would be a good fit for this, and as Wadjy Essam mentioned it uses JSoup for the HMLT. mechanize is a stageful HTTP/HTML client that supports navigation, form submissions, and page scraping.

http://gistlabs.com/software/mechanize-for-java/ (and the GitHub here https://github.com/GistLabs/mechanize)


There is also Jaunt Java Web Scraping & JSON Querying - http://jaunt-api.com