How to get list of files/directories of an directory URL?

Let say I have a URL: http://java.sun.com/j2se/1.5/pdf I want to get a list of all files/directories under the pdf directory.

I'm using Java 5.

I can get the list of dir with this program http://www.httrack.com/, but with Java I don't know if it is possible.

Does any body know how to get it in Java? Or how can this program do the job if Java can't?


There are some conditions:

  1. The server must have enabled directory listing in order for you to see the content of it.
  2. There is no way I know of (no API or HTTP verb) to retrieve the listing, and so the listing is generally shown as a normal HTML page
  3. You will have to parse this HTML page in order to find the entries.

The parsing can be done easily using a lib like JSoup.

For example, using JSoup you can fetch the documents at url http://howto.unixdev.net/ like this:

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class Sample {
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("http://howto.unixdev.net").get();
        for (Element file : doc.select("td.right td a")) {
            System.out.println(file.attr("href"));
        }
    }
}

Will output:

beignets.html
beignets.pdf
bsd-pam-ldap.html
ddns-updates.html
Debian_on_HP_dv6z.html
dextop-slackware.html
dirlist.html
downloads/
ldif/
Linux-SharePoint.html
rhfc3-apt.html
rhfc3-apt.tar.bz2
SUNWdsee-Debian.html
SUNWdtdte-b69.html
SUNWdtdte-b69.tar.bz2
tcshrc.html
Test_LVM_Trim_Ext4.html
Tru64-CS20-HOWTO.html

As for your sample url http://java.sun.com/j2se/1.5/pdf this is a page not found, so I think you're out of luck.


If the URL is for the file: protocol, then you could convert it to a java.io.File, then use those methods to list the directory.

If the URL is for the http: protocol, then there is no concept of directories of files, and you fundamentally cannot do what you think you want to do. You will have to step back and look at the higher-level requirement you are trying to fulfill.

Have your server deploy a Servlet to retrieve a list of files from the folder specified by the request it receives. At your client end point, your application sends a request to the server by providing a path (virtual? relative ?) you intend to list. The servlet will return the list of files in the requested path, retrieved from the server's OS. Then, it serializes the file list to the client end point for further processing.

If you can render the page with HTTP access only then:
Use the HTML page and parse it giving directory listing to get the list of the files and viz-a-viz using regular expression to render the file names.