Parseable NGINX accesslog files with delimiters

The default NGINX format is this:

log_format combined '$remote_addr - $remote_user [$time_local]  '
                '"$request" $status $body_bytes_sent '
                '"$http_referer" "$http_user_agent"';

Which is a bit hard to parse. I am afraid that people inject " in either requests, referrers or user-agents.

I have thought about using delimiters instead, and use my own format, that uses |P-,| as a delimiter:

log_format parsable '$status |P-,| $time_iso8601 |P-,| $http_host 
|P-,| $bytes_sent |P-,| $http_user_agent |P-,| $http_referer 
|P-,| $request_time |P-,| $request';

However, nothing prevents users from injecting |P-,| into their requests, referrers or user-agents.

I read this article about ASCII delimited text: https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/

I think that could be used to solve this problems, but users would be able to inject ASCII delimiters into their data as well.

Is there a best-practice way to solve this problem?


Solution 1:

There is no problem.

I am afraid that people inject " in either requests, referrers or user-agents.

" is represented as \x22

Request:

$ curl 'localhost/"?"="' --header 'User-Agent: "'

line in log:

[27/Mar/2014:16:14:42 +0400] localhost 127.0.0.1 "GET /\x22?\x22=\x22 HTTP/1.1" 200 "-" "\x22" "-" "/index.html"

UPDATE

From nginx changelog

Changes with nginx 1.1.6 17 Oct 2011

*) Change: now the 0x7F-0x1F characters are escaped as \xXX in an
   access_log.

Changes with nginx 0.7.0 19 May 2008

*) Change: now the 0x00-0x1F, '"' and '\' characters are escaped as \xXX
   in an access_log.
   Thanks to Maxim Dounin.

Solution 2:

Remember a number of the fields are generated by the system, so are safe. If you ensure those fields are to the left and the hackable ones are to the right (http_user_agent should at the end, and the http_referer before that, request should be before that), you can ensure most of the data is sound, and by adding more delimiters to the parser (an optional one on the far right) than can possibly exist without insertion, then your parser will detect records that have been subject to insertion.

Also I recommenced using a tab character as a delimiter, as I believe were someone to attempt to insert it into a url, it'd end up being escaped to a %09