How can I safely encode a string in Java to use as a filename?

I'm receiving a string from an external process. I want to use that String to make a filename, and then write to that file. Here's my code snippet to do this:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), s);
    PrintWriter currentWriter = new PrintWriter(currentFile);

If s contains an invalid character, such as '/' in a Unix-based OS, then a java.io.FileNotFoundException is (rightly) thrown.

How can I safely encode the String so that it can be used as a filename?

Edit: What I'm hoping for is an API call that does this for me.

I can do this:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), URLEncoder.encode(s, "UTF-8"));
    PrintWriter currentWriter = new PrintWriter(currentFile);

But I'm not sure whether URLEncoder it is reliable for this purpose.


Solution 1:

My suggestion is to take a "white list" approach, meaning don't try and filter out bad characters. Instead define what is OK. You can either reject the filename or filter it. If you want to filter it:

String name = s.replaceAll("\\W+", "");

What this does is replaces any character that isn't a number, letter or underscore with nothing. Alternatively you could replace them with another character (like an underscore).

The problem is that if this is a shared directory then you don't want file name collision. Even if user storage areas are segregated by user you may end up with a colliding filename just by filtering out bad characters. The name a user put in is often useful if they ever want to download it too.

For this reason I tend to allow the user to enter what they want, store the filename based on a scheme of my own choosing (eg userId_fileId) and then store the user's filename in a database table. That way you can display it back to the user, store things how you want and you don't compromise security or wipe out other files.

You can also hash the file (eg MD5 hash) but then you can't list the files the user put in (not with a meaningful name anyway).

EDIT:Fixed regex for java

Solution 2:

It depends on whether the encoding should be reversible or not.

Reversible

Use URL encoding (java.net.URLEncoder) to replace special characters with %xx. Note that you take care of the special cases where the string equals ., equals .. or is empty!¹ Many programs use URL encoding to create file names, so this is a standard technique which everybody understands.

Irreversible

Use a hash (e.g. SHA-1) of the given string. Modern hash algorithms (not MD5) can be considered collision-free. In fact, you'll have a break-through in cryptography if you find a collision.


¹ You can handle all 3 special cases elegantly by using a prefix such as "myApp-". If you put the file directly into $HOME, you'll have to do that anyway to avoid conflicts with existing files such as ".bashrc".
public static String encodeFilename(String s)
{
    try
    {
        return "myApp-" + java.net.URLEncoder.encode(s, "UTF-8");
    }
    catch (java.io.UnsupportedEncodingException e)
    {
        throw new RuntimeException("UTF-8 is an unknown encoding!?");
    }
}

Solution 3:

Here's what I use:

public String sanitizeFilename(String inputName) {
    return inputName.replaceAll("[^a-zA-Z0-9-_\\.]", "_");
}

What this does is is replace every character which is not a letter, number, underscore or dot with an underscore, using regex.

This means that something like "How to convert £ to $" will become "How_to_convert___to__". Admittedly, this result is not very user-friendly, but it is safe and the resulting directory /file names are guaranteed to work everywhere. In my case, the result is not shown to the user, and is thus not a problem, but you may want to alter the regex to be more permissive.

Worth noting that another problem I encountered was that I would sometimes get identical names (since it's based on user input), so you should be aware of that, since you can't have multiple directories / files with the same name in a single directory. I just prepended the current time and date, and a short random string to avoid that. (an actual random string, not a hash of the filename, since identical filenames will result in identical hashes)

Also, you may need to truncate or otherwise shorten the resulting string, since it may exceed the 255 character limit some systems have.