LOGSTASH SPLIT FILTER

I wanted to spend some time talking about a split filter because event splitting is something that should be handled with care.

Consider the following object structure, which is very common in logging on cloud platforms

{
    "records":[
        {
            "timestamp":1337,
            "eventType": "login",
            "guid":"SomeGloballyUniqueId"
        },
        {
            "timestamp":1338,
            "eventType": "logout",
            "guid":"SomeGloballyUniqueId2"
        }
    ]
}

The root key of the object is records and each item in the records array represents 1 document that we would want at the elasticsearch side. In the best case, we need not be too concerned about this. On common beats inputs like the AWS S3 Input we can specify a field in the expand_event_list_from_field setting which will split the events into the desired state

filebeat.inputs:
- type: aws-s3
  queue_url: https://sqs.ap-southeast-1.amazonaws.com/1234/test-s3-queue
  credential_profile_name: elastic-beats
  expand_event_list_from_field: Records

Outside of an input module though, there’s not a current (8.7) built-in processor in beats that can accomplish the same thing, so for any event source that requires the same sort of splitting, you’re left with really 2 options.

  1. Transform the event pre-ingestion (i.e. before touching any Elastic components)
  2. Transform the event at logstash (which is the only component that supports this 1 => Many split behavior)

Option 1 should always be chased where feasible, but the following issues eventually need to be addressed in my experience.

  1. Logs generated a script that lives on a server: SME that wrote the script thought this was a done deal and wants to forget it, now you get pushback from SME
  2. There may not be technical know how or segregation of duties is a blocker: if the logs live in cloud infrastructure, it should be straightforward to create a process (I’ve used Lambdas and Glue to process logs) that can transform the logs pre-Elastic, however (1) someone needs to create and own that process and ensure it’s healthy and (2) segregation of duties might make this much more of a headache than it needs to be and (3) if you’re processing logs = > storing to a ‘processed’ s3 bucket, you’re then doubling your storage costs for the log data

Option 2 is very straightforward for anyone that knows Logstash well.

filter {
 split {
   field => "records"
   target => "_wip_split"
 }
 #### remove the records array on successful split
 if [_wip_split] { mutate { remove_field => "records"}}
}

But you should note that you have now married this log source to logstash and created a dependency where one might not be feasible desired.

  1. If you’re in elastic-agent land, then you have additional certs (agent => logstash) to maintain.