Logstash parsing xml document containing multiple log entries

Solution 1:

Alright, I found a solution that does work for me. The biggest problem with the solution is that the XML plugin is ... not quite unstable, but either poorly documented and buggy or poorly and incorrectly documented.

TLDR

Bash command line:

gzcat -d file.xml.gz | tr -d "\n\r" | xmllint --format - | logstash -f logstash-csv.conf

Logstash config:

input {
    stdin {}
}

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
    # multiline filter adds the tag "multiline" only to lines spanning multiple lines
    # We _only_ want those here.
    if "multiline" in [tags] {
        # Add the encoding line here. Could in theory extract this from the
        # first line with a clever filter. Not worth the effort at the moment.
        mutate {
            replace => ["message",'<?xml version="1.0" encoding="UTF-8" ?>%{message}']
        }
        # This filter exports the hierarchy into the field "entry". This will
        # create a very deep structure that elasticsearch does not really like.
        # Which is why I used add_field to flatten it.
        xml {
            target => entry
            source => message
            add_field => {
                fieldx         => "%{[entry][fieldx]}"
                fieldy         => "%{[entry][fieldy]}"
                fieldz         => "%{[entry][fieldz]}"
                # With deeper nested fields, the xml converter actually creates
                # an array containing hashes, which is why you need the [0]
                # -- took me ages to find out.
                fielda         => "%{[entry][fieldarray][0][fielda]}"
                fieldb         => "%{[entry][fieldarray][0][fieldb]}"
                fieldc         => "%{[entry][fieldarray][0][fieldc]}"
            }
        }
        # Remove the intermediate fields before output. "message" contains the
        # original message (XML). You may or may-not want to keep that.
        mutate {
            remove_field => ["message"]
            remove_field => ["entry"]
        }
    }
}

output {
    ...
}

Detailed

My solution works because at least until the entry level, my XML input is very uniform and thus can be handled by some kind of pattern matching.

Since the export is basically one really long line of XML, and the logstash xml plugin essentially works only with fields (read: columns in lines) that contain XML data, I had to change the data into a more useful format.

Shell: Preparing the file

  • gzcat -d file.xml.gz |: Was just too much data -- obviously you can skip that
  • tr -d "\n\r" |: Remove line-breaks inside XML elements: Some of the elements can contain line breaks as character data. The next step requires that these are removed, or encoded in some way. Even though it assumed that at this point you have all XML code in one massive line, it does not matter if this command removes any white space between elements

  • xmllint --format - |: Format the XML with xmllint (comes with libxml)

    Here the single huge spaghetti line of XML (<root><entry><fieldx>...</fieldx></entry></root>) Is properly formatted:

    <root>
      <entry>
        <fieldx>...</fieldx>
        <fieldy>...</fieldy>
        <fieldz>...</fieldz>
        <fieldarray>
          <fielda>...</fielda>
          <fieldb>...</fieldb>
          ...
        </fieldarray>
      </entry>
      <entry>
        ...
      </entry>
      ...
    </root>
    

Logstash

logstash -f logstash-csv.conf

(See full content of the .conf file in the TL;DR section.)

Here, the multiline filter does the trick. It can merge multiple lines into a single log message. And this is why the formatting with xmllint was necessary:

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
}

This basically says that every line with indentation that is more than two spaces (or is </entry> / xmllint does indentation with two spaces by default) belongs to a previous line. This also means character data must not contain newlines (stripped with tr in shell) and that the xml must be normalised (xmllint)

Solution 2:

I had a similar case. To parse this xml:

<ROOT number="34">
  <EVENTLIST>
    <EVENT name="hey"/>
    <EVENT name="you"/>
  </EVENTLIST>
</ROOT>

I use this configuration to logstash:

input {
  file {
    path => "/path/events.xml"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "<ROOT"
      negate => "true"
      what => "previous"
      auto_flush_interval => 1
    }
  }
}
filter {
  xml {
    source => "message"
    target => "xml_content"
  }
  split {
    field => "xml_content[EVENTLIST]"
  }
  split {
    field => "xml_content[EVENTLIST][EVENT]"
  }
  mutate {
    add_field => { "number" => "%{xml_content[number]}" }
    add_field => { "name" => "%{xml_content[EVENTLIST][EVENT][name]}" }
    remove_field => ['xml_content', 'message', 'path']
  }
}
output {
  stdout {
    codec => rubydebug
  }
}

I hope this can help someone. I've needed a long time to get it.