03.30.18

Build your own fake news detector using machine learning

It is a sign of the times that in 2018, the UK Government established a new unit to tackle fake news, and every day seems to reveal more about the dirty tricks played by companies like Cambridge Analytica, including deliberately spreading misinformation, to try and influence electorates in favour of whoever happens to be paying them.

This is a problem because it hands more power to those with money — people and groups who already have plenty — and away from ordinary people, who democracy is supposed to serve. Whether our political preferences happen to be served by recent fake news or not, everybody should be concerned about this.

It is also becoming increasingly difficult to doubt that certain foreign states are even interfering in elections and referendums of other countries in a significant way, by spreading fake news to those who are most willing to receive it.

The recent advances in the availability of artificial intelligence presents us with an opportunity to tackle this problem at scale.

Machine learning classifiers

Classifiers are machine learning models that can take a set of examples and learn which class each example falls into. You can then ask it to classify content that it has never seen before, with impressive accuracy.

For example, let’s say we want to teach a model to classify neutral vs biased article titles. We would gather as many examples of each class as we could:

class    example
neutral  China's space lab set for fiery re-entry
biased   10 Reasons Progressive Liberals Should Vote for Trump
neutral  Thousands of violent crime suspects released
biased   Right wing people are stupid
neutral  New £30m fund to help rough sleepers

Then we would use a technology like Classificationbox to train a model based on these inputs.

How the model learns is very complicated, and extremely difficult to understand if you try and poke around at the internal data structures it uses, but we know it works because we can test it by sending in some of our examples without providing the class, and asking it to make a prediction.

The number of these test predictions it gets right is how we measure accuracy.

Who decides what’s neutral or biased?

It is unlikely we will all agree on the classes in the examples above, so how do we decide what’s neutral and what’s biased?

The model is only as good as the training data, and it will indeed be biased based on who gathered that data. It would of course be possible to train a model that thinks heavily biased statements are neutral, and vice versa.

In a real world situation, the answer is probably to have the training data open, so that everybody can contribute to, and maintain it. Otherwise you have to trust the people who trained the models.

Machine Box provides Fakebox, a fake news classifier trained with significant datasets based on common sense classification of news articles. Any articles that could fairly be considered only slightly biased were not included, so the model does a good job in most people’s eyes.

In this article, we will see how we can build our own Fakebox alternative, using our own training data using Classificationbox.

Prepare training data

The simplest way to prepare training data is to create a folder — if you’re doing this as a team, then a shared folder might work best (like DropBox, or Google Drive, or some internal network location) so everybody can contribute.

Create a subfolder for each class, for example:

/training-data
    /biased
    /neutral
    /satire
    /junksci

Then put each example into a text file inside the appropraite folder, like this:

/training-data
    /biased
        biased-example1.txt
        biased-example2.txt
        biased-example3.txt

Balance the classes

One key principle is that each class should have more or less the same number of examples. If you have more examples of junk science than anything else, the model will likely be biased towards that class.

How much data?

The best number of examples differs widely depending on a number of factors, but you should start with at least 100 examples in each class, and go from there.

The teaching approach

The best way to teach the classifier, is to take 80% of your example data and teach it.

In Classificaitonbox, this can be done with a simple HTTP POST request:

POST /classificationbox/models/1/teach
{
  "class": "biased",
  "inputs": [
    {"key": "content", "type": "text", "value": "...content..."}
  ]
}

The slightly verbose inputs array and key/type/value objects exist because you can actually teach classifiers with a range of different data types, including numbers and even images.

Then take the remaining 20% of the data, and ask Classificationbox to try and guess which class it belongs in. You do this using a predict request:

POST /classificationbox/models/1/predict
{
  "inputs": [
    {"key": "content", "type": "text", "value": "...content..."}
  ]
}

Notice that we do not provide the class in this request.

Classificationbox also supports multiple languages, which you can specify by using the language code in the type, for example test_sp for Spanish language.

Meet the textclass tool

The textclass tool is a simple Go program that walks a folder structure like the one described above, and performs the teaching of Classificationbox for you.

You can install it (assuming you have Go installed) by popping this into a terminal:

go get github.com/machinebox/toys/textclass

Run Classificaitonbox

You can run Classificationbox for free by going to your terminal, and doing:

$ docker run -p 8080:8080 -e "MB_KEY=$MB_KEY" 
         machinebox/classificationbox
  • You’ll need to have Docker installed
  • If you don’t have an MB_KEY, you can get one for free from the Machine Box website

Train and validate

In a terminal, run the textclass tool:

textclass -src /path/to/training-data

You will be prompted a few times to hit Y to confirm what the tool will do:

Classes
 — — — -
fake: 300 item(s)
real: 300 item(s)
satire: 300 item(s)
Create new model with 3 classes? (y/n): y
new model created: 5abe0d3302484439
Teach and validate Classificationbox with 720 (80%) random items? (y/n): y

After some time, you will be presented with the results:

Validation complete
Correct:    173
Incorrect:  7
Errors:     0
Accuracy:   96%

So we now have a classifier that can make predictions, with 96% accuracy.

Put into production?

If you want to put your model into production, you can — since Classificationbox is just a Docker container, you can save the state file, and use it when spinning up new instances in your own environment, or in the cloud.

If you share the same state file with multiple instances of Classificationbox, you can load balance the traffic to achieve planet scale.

Need help?

We hang out all day in the Machine Box Community Slack, and you’re invited to join us to ask questions, or tell us about what you’ve built.