03.16.18

Mute uninteresting log noise with Machine Learning

Most people have a way of building their own machine learning textual classifiers; you can use cloud APIs, learn Tensorflow or roll your own, or use Machine Box. As founder of the latter, it won’t surprise you to learn that in this article I am going to tackle log data using Classificationbox — a Docker container that lets you build your own classifiers by just making JSON/HTTP requests.

The problem

Most applications spit out many lines of log data of varying importance and usefulness.

Wouldn’t it be nice if an AI could draw our attention to important log items?

The data all starts to look the same, and it can be tough to notice when something is amiss, or to find key information that gets lost in the scroll.

Wouldn’t it be nice if a helpful AI could read the logs for us, and only show us the interesting bits?


Classifiers

A classifier is a machine learning model that can predict which category (or class) a piece of input data falls into. For example, you can build a classifier to identify if an image is of a dog or a cat.

You can build a classifier from any of kind of data, and Classificationboxsupports numbers, text blobs, keywords, lists and images — or any combination.

Classificationbox lets you build classifiers from many different types of data inputs

Classifying log items

If we can build a classifier that knows the difference between noise and not-noise in lines of log text, then we could apply this knowledge automatically to filter out or turn down the noise. If it gets really good, it could even form the basis of a notification system, but let’s not get ahead of ourselves.

Later, I’ll show you a simple Go program that just pipes lines of log streams through it, dimming the noise and highlighting the useful stuff.

To start with, we need some training data. Getting noise is easy enough, but for classifiers to work best they need a balanced number of examples — that is, a similar number of examples for each class in the training data.

Assuming we have the data, we then teach the classifier each example with some code like this: (this is Go code, btw)

cb.Teach(ctx, "model1", classificationbox.Example{
    Class: class,
    Inputs: []classificationbox.Feature{
        classificationbox.FeatureText("line", line),
    },
})

The line and class variables would contain the example log line, and whether we consider it to be noise or not, respectively.

Best practice is to use a random 80% of your training data to teach the model, and the remaining 20% to validate it. To see this in action, read Build a machine learning image classifier from photos on your hard drive very quickly.

Making predictions

Once the training is complete and our model is ready, we can then start to ask it to predict which class each line of log text is most likely to fall into.

With a simple HTTP request, this is done like this:

POST /classificationbox/models/1/predict
{
    "limit": 1
    "inputs": [{
        "key": "line",
        "type": "text",
        "value": "Log line goes here"
    }]
}

You can of course also use the Classificationbox Go SDK here too if you like.

Here we are asking the model to pick one class (via the limit field) based on the log text line. The key and type should match the examples, but obviously the value can be different.

The response will look something like this:

{
    "success": true,
    "classes": [{
            "id": "noise",
            "score": 0.97
    }]
}

We can see that the model predicts that the log entry is noise. For non-noise, it will return the other class.

Because Classificationbox can make predictions very quickly, you can incorporate this capability into a variety of different places in your stack.

Once interesting approach is to build a little bash tool that you can pipe log streams through and ask it to hide or dim the noise, highlighting the bits we care about.

A tool to pipe log items through

When exploring this idea, I wrote this Go program that you can use in the terminal to pipe through log files, or even the output of running programs, and have the output colour change depending on whether that item is considered interesting or not.

You could run it in a variety of ways using unix pipes to stream the data:

cat logfile.txt | logclass

or something like:

sometool | logclass | notifytool

Here’s the code

https://blog.machinebox.io/mute-uninteresting-log-noise-with-machine-learning-daa19f6e222