Crashing my disk using Java

I have a Java program that must do the following 3 things:

  1. Download a file from a website.
  2. Run the file trough testA and testB (both in java)
  3. Delete the file and save the test results on the disk.

That is done for roughly 1,000,000 different sites. It was supposed to be a fairly simple task, since I simply glued parts from other programs: testA and testB both already executed separately for dozens of millions of different pages without trouble, and the routine that download the pages has also been executed for a million or so pages some times, and never had any problems either. They are all being executed on a Ubuntu 10.4 machine.

However, when doing those 3 one after the other, whichever disk the files were being written to crashes. The first time I ran it on a External USB HD, which I had to manually disconnect and reconnect for it to resume operation (Linux wouldn't recognize it otherwise). Next time, on an Internal HD, the whole system went down, and I had to manually restart it. The same happened when writing to a Ram Disk.

The problem is that I can't really isolate the problem. It takes way too long for a crash to occur (about 50 hours or so, but it is quite random), so testing takes too long, and there are no system logs of the failure indicating where/how it happens. The machine or HD simply stops responding.

Except for the crashing, everything works fine. Files are created and deleted normally, threads don't die and execute properly, and both tests work alright. Changing memory or the number of threads had no effect on lockup time. I already checked for Sockets or the like not being closed, but I don't even know how to start testing, I had no idea crash a system so catastrophically would be possible with Java.

EDIT: By hangup I mean that, when I run it on an external HD the HD won't get recognized by Linux, and when I run it at the internal HD or a Ram Disk the computer won't respond to any I/O whatsoever, with nothing being written to the disk, cactii logs not being recorded, etc. Can't connect using SSH, for instance.

A example of how the program runs:

List<String> pagesToDownload = getFromDataBase();
for(i=0;i<NumThreads;i++){
    launchTestThread();
}

And then, on each thread:

String pageName = getNextPageToDownload();
File downloadedFile = downloadPage(pageName);
TestAResults testAResults = runTestA(downloadedFile);
TestBResults testBResults = runTestB(downloadedFile);
writeToDatabase(downloadedFile, testAResults, testBResults);
downloadedFile.delete();

Individually the functions runTestA, runTestB and downloadedPage work for even larger amounts of files, but when called that way, they don't. And that on the very same hardware.

EDIT2: I think I ruled out the problem being the hardware. A just as hardware intensive software has been running at the very same machine for the last 7 days without any problems. Anyway, as soon as I can get a unused machine I'm gonna test the program in it.

Also, everything on the test is being written to the database up to the point where the crash occurs, and the data is correct. The downloadedFile ain't passed as a parameter during the writeToDatabase method, just it's name and size.

Finally, I did some extensive checking for memory or file handlers leaks and turned up none, including inside the working tests. Right now, my money is on some strange bug on the file deletion.

EDIT3: I finally managed to get another machine where to test the routine. Another Hardware, but same Ubuntu Version (10.4 LTS). And it crashes there too, so I really doubt it is a hardware problem. That leaves either an OS bug, a JVM bug or a programming bug (there is no JNI or anything like it running). I'm gonna try running the test in some other environment (setting up a test in FreeBSD will be pretty easy, and I can try to find an Windows machine to test that) in order to verify that.

EDIT4: Answering Bob Cross' question about how big the files are, they are typical web pages, with an average of about 20kb. I gotta delete them since the idea is to expand the application, making the disk usage unbearable. But I'm gonna try a Deletion free run as soon as I can. The machine where I was running those tests is being used right now, and I'm having a hard time getting some idle hardware.


If the system stops responding, it is a bug in the operating system, or a hardware problem. A program should not be able to hang a system, no matter how buggy.

Run your program on a different system and see if it provides a helpful diagnostic for your program instead of just keeling over.


It's a little hard to see exactly what you're doing with the summary above but I'm going to try to make a guess based on this paragraph:

However, when doing those 3 one after the other, whichever disk the files were being written to crashes. The first time I ran it on a External USB HD, which I had to manually disconnect and reconnect for it to resume operation (Linux wouldn't recognize it otherwise). Next time, on an Internal HD, the whole system went down, and I had to manually restart it. The same happened when writing to a Ram Disk.

How are you managing your results file relative to this list of million site list that you're checking? Specifically, does your code look like this (abstracted pseudocode follows):

  1. Open results file.
  2. Loop begins : For each of a million sites
  3. TestA on site
  4. TestB on site : Loop ends
  5. Write results of site tests to results file.

If so, I suspect you have a problem with the fact that you're slowly accumulating results that haven't been written out to your results file. I suspect that they're sitting in one of several caches waiting for a chance to flush those changes to the disk.

If the above is what you're doing, try this instead:

  1. Loop begins : For each of a million sites
  2. TestA on site.
  3. TestB on site.
  4. Open results file.
  5. Write results of site tests to results file.
  6. Close results file : Loop ends

That should ensure that after every test, you write a result to the file. At a minimum, you should be able to observe the process of your program.

EDIT: following up on edit to question:

writeToDatabase(downloadedFile, testAResults, testBResults); 

Two questions:

  1. Are changes ever written to your database? As in, they write for a while and then stop? Or is the database completely empty?

  2. Is that really how the code is written? If so, you have a potential memory leak: that File reference is being passed into your writeToDatabase method. Are you sure that nothing is hanging onto that reference?

It's possible that you are holding to too many file handles as a consequence of the leaking File references and / or native file handles. It's worth checking.

EDIT AGAIN based on more feedback:

Right now, my money is on some strange bug on the file deletion.

That's a possibility. Some things to think about:

  1. How big are the files that you are downloading? Could you try a debugging run without the deletes? If you run longer without that code, that would certainly be an interesting result.

  2. I'm still concerned that you are leaking file handles. If you check the number of files processed and find some suspicious number of files processed like a multiple of 1024, I would look more closely at whatever's in the delete method. It's a little hard to tell without the implementation, of course.


Without any further information, I'd guess that you've filled your disk. Being unable to connect via SSH is a little strange, but possible if you're unable to write auth.log.