Using Java to pull data from a webpage?
I'm attempting to make my first program in Java. The goal is to write a program that browses to a website and downloads a file for me. However, I don't know how to use Java to interact with the internet. Can anyone tell me what topics to look up/read about or recommend some good resources?
Solution 1:
The simplest solution (without depending on any third-party library or platform) is to create a URL instance pointing to the web page / link you want to download, and read the content using streams.
For example:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class DownloadPage {
public static void main(String[] args) throws IOException {
// Make a URL to the web page
URL url = new URL("http://stackoverflow.com/questions/6159118/using-java-to-pull-data-from-a-webpage");
// Get the input stream through URL Connection
URLConnection con = url.openConnection();
InputStream is = con.getInputStream();
// Once you have the Input Stream, it's just plain old Java IO stuff.
// For this case, since you are interested in getting plain-text web page
// I'll use a reader and output the text content to System.out.
// For binary content, it's better to directly read the bytes from stream and write
// to the target file.
try(BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String line = null;
// read each line and write to System.out
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}
}
}
Hope this helps.
Solution 2:
The Basics
Look at these to build a solution more or less from scratch:
- Start from the basics: The Java Tutorial's chapter on Networking, including Working With URLs
- Make things easier for yourself: Apache HttpComponents (including HttpClient)
The Easily Glued-Up and Stitched-Up Stuff
You always have the option of calling external tools from Java using the exec()
and similar methods. For instance, you could use wget
, or cURL
.
The Hardcore Stuff
Then if you want to go into more fully-fledged stuff, thankfully the need for automated web-testing as given us very practical tools for this. Look at:
- HtmlUnit (powerful and simple)
- Selenium, Selenium-RC
- WebDriver/Selenium2 (still in the works)
- JBehave with JBehave Web
Some other libs are purposefully written with web-scraping in mind:
- JSoup
- Jaunt
Some Workarounds
Java is a language, but also a platform, with many other languages running on it. Some of which integrate great syntactic sugar or libraries to easily build scrapers.
Check out:
- Groovy (and its XmlSlurper)
- or Scala (with great XML support as presented here and here)
If you know of a great library for Ruby (JRuby, with an article on scraping with JRuby and HtmlUnit) or Python (Jython) or you prefer these languages, then give their JVM ports a chance.
Some Supplements
Some other similar questions:
- Scrape data from HTML using Java
- Options for HTML Scraping