When is it best to sanitize user input?

User equals untrustworthy. Never trust untrustworthy user's input. I get that. However, I am wondering when the best time to sanitize input is. For example, do you blindly store user input and then sanitize it whenever it is accessed/used, or do you sanitize the input immediately and then store this "cleaned" version? Maybe there are also some other approaches I haven't though of in addition to these. I am leaning more towards the first method, because any data that came from user input must still be approached cautiously, where the "cleaned" data might still unknowingly or accidentally be dangerous. Either way, what method do people think is best, and for what reasons?


Solution 1:

Unfortunately, almost no one of the participants ever clearly understands what are they talking about. Literally. Only Kibbee managed to make it straight.

This topic is all about sanitization. But the truth is, such a thing like wide-termed "general purpose sanitization" everyone is so eager to talk about is just doesn't exist.

There are a zillion different mediums, each require it's own, distinct data formatting. Moreover - even single certain medium require different formatting for it's parts. Say, HTML formatting is useless for javascript embedded in HTML page. Or, string formatting is useless for the numbers in SQL query.

As a matter of fact, such a "sanitization as early as possible", as suggested in most upvoted answers, is just impossible. As one just cannot tell in which certain medium or medium part the data will be used. Say, we are preparing to defend from "sql-injection", escaping everything that moves. But whoops! - some required fields weren't filled and we have to fill out data back into form instead of database... with all the slashes added.

On the other hand, we diligently escaped all the "user input"... but in the sql query we have no quotes around it, as it is a number or identifier. And no "sanitization" ever helped us.

On the third hand - okay, we did our best in sanitizing the terrible, untrustworthy and disdained "user input"... but in some inner process we used this very data without any formatting (as we did our best already!) - and whoops! have got second order injection in all its glory.

So, from the real life usage point of view, the only proper way would be

  • formatting, not whatever "sanitization"
  • right before use
  • according to the certain medium rules
  • and even following sub-rules required for this medium's different parts.

Solution 2:

I like to sanitize it as early as possible, which means the sanitizing happens when the user tries to enter in invalid data. If there's a TextBox for their age, and they type in anything other that a number, I don't let the keypress for the letter go through.

Then, whatever is reading the data (often a server) I do a sanity check when I read in the data, just to make sure that nothing slips in due to a more determined user (such as hand-editing files, or even modifying packets!)

Edit: Overall, sanitize early and sanitize any time you've lost sight of the data for even a second (e.g. File Save -> File Open)

Solution 3:

I sanitize my user data much like Radu...

  1. First client-side using both regex's and taking control over allowable characters input into given form fields using javascript or jQuery tied to events, such as onChange or OnBlur, which removes any disallowed input before it can even be submitted. Realize however, that this really only has the effect of letting those users in the know, that the data is going to be checked server-side as well. It's more a warning than any actual protection.

  2. Second, and I rarely see this done these days anymore, that the first check being done server-side is to check the location of where the form is being submitted from. By only allowing form submission from a page that you have designated as a valid location, you can kill the script BEFORE you have even read in any data. Granted, that in itself is insufficient, as a good hacker with their own server can 'spoof' both the domain and the IP address to make it appear to your script that it is coming from a valid form location.

  3. Next, and I shouldn't even have to say this, but always, and I mean ALWAYS, run your scripts in taint mode. This forces you to not get lazy, and to be diligent about step number 4.

  4. Sanitize the user data as soon as possible using well-formed regexes appropriate to the data that is expected from any given field on the form. Don't take shortcuts like the infamous 'magic horn of the unicorn' to blow through your taint checks... or you may as well just turn off taint checking in the first place for all the good it will do for your security. That's like giving a psychopath a sharp knife, bearing your throat, and saying 'You really won't hurt me with that will you".

    And here is where I differ than most others in this fourth step, as I only sanitize the user data that I am going to actually USE in a way that may present a security risk, such as any system calls, assignments to other variables, or any writing to store data. If I am only using the data input by a user to make a comparison to data I have stored on the system myself (therefore knowing that data of my own is safe), then I don't bother to sanitize the user data, as I am never going to us it a way that presents itself as a security problem. For instance, take a username input as an example. I use the username input by the user only to check it against a match in my database, and if true, after that I use the data from the database to perform all other functions I might call for it in the script, knowing it is safe, and never use the users data again after that.

  5. Last, is to filter out all the attempted auto-submits by robots these days, with a 'human authentication' system, such as Captcha. This is important enough these days that I took the time to write my own 'human authentication' schema that uses photos and an input for the 'human' to enter what they see in the picture. I did this because I've found that Captcha type systems really annoy users (you can tell by their squinted-up eyes from trying to decipher the distorted letters... usually over and over again). This is especially important for scripts that use either SendMail or SMTP for email, as these are favorites for your hungry spam-bots.

To wrap it up in a nutshell, I'll explain it as I do to my wife... your server is like a popular nightclub, and the more bouncers you have, the less trouble you are likely to have in the nightclub. I have two bouncers outside the door (client-side validation and human authentication), one bouncer right inside the door (checking for valid form submission location... 'Is that really you on this ID'), and several more bouncers in close proximity to the door (running taint mode and using good regexes to check the user data).

I know this is an older post, but I felt it important enough for anyone that may read it after my visit here to realize their is no 'magic bullet' when it comes to security, and it takes all these working in conjuction with one another to make your user-provided data secure. Just using one or two of these methods alone is practically worthless, as their power only exists when they all team together.

Or in summary, as my Mum would often say... 'Better safe than sorry".

UPDATE:

One more thing I am doing these days, is Base64 encoding all my data, and then encrypting the Base64 data that will reside on my SQL Databases. It takes about a third more total bytes to store it this way, but the security benefits outweigh the extra size of the data in my opinion.

Solution 4:

It depends on what kind of sanitizing you are doing.

For protecting against SQL injection, don't do anything to the data itself. Just use prepared statements, and that way, you don't have to worry about messing with the data that the user entered, and having it negatively affect your logic. You have to sanitize a little bit, to ensure that numbers are numbers, and dates are dates, since everything is a string as it comes from the request, but don't try to do any checking to do things like block keywords or anything.

For protecting against XSS attacks, it would probably be easier to fix the data before it's stored. However, as others mentioned, sometimes it's nice to have a pristine copy of exactly what the user entered, because once you change it, it's lost forever. It's almost too bad there's not a fool proof way to ensure you application only puts out sanitized HTML the way you can ensure you don't get caught by SQL injection by using prepared queries.

Solution 5:

The most important thing is to always be consistent in when you escape. Accidental double sanitizing is lame and not sanitizing is dangerous.

For SQL, just make sure your database access library supports bind variables which automatically escapes values. Anyone who manually concatenates user input onto SQL strings should know better.

For HTML, I prefer to escape at the last possible moment. If you destroy user input, you can never get it back, and if they make a mistake they can edit and fix later. If you destroy their original input, it's gone forever.