General tips for interpreting error logs [closed]

Reading log files can be pretty frustrating as, by nature, their content says as much about the developer who penned them as the problem itself.

Do you have any general purpose tips for interpreting error logs (eg: "google is your friend" or "some error codes occur more than others" or "remember that warnings and errors are very different")?


Solution 1:

Let developers troubleshoot production issues once in a while. This will do wonders for your logging. :)

Solution 2:

About a specific common situation when you have all of these at the same time: (1) a problem in a distributed environment (2) a huge pile of debug info scattered over co-operating servers and different logfiles (3) no documentation for interpreting the logs (4) nothing on google (5) no clue (6) ping-pong players instead of vendor's support.

  • First of all, make sure that the time is synchronized in the entire environment (ntp). If it is not, forget about trying to find out inter-host relationships from their log files.
  • Do not pick up a random "error" from a random log to blame. Read the log chronologically, remembering that "error" line may be as well result of normal software operation and always been there.
  • Compare logs from proper operation to the logs from problem situation. At what point they cease to match? (vimdiff might be useful)
  • If during test cases you have the functionality to insert your own custom log messages, use it. (like logger in syslog)
  • On analysis, if you catch yourself switching between many huge logs back and forth, trying to catch the flow of action - try to merge the logs. (Use sed to place time on first column. Use cat+sort to merge multiple files. And of course grep -viE for filtering unnecessary lines.)