Build your own fake news detector using machine learning
It is a sign of the times that in 2018, the UK Government established a new unit to tackle fake news, and every day seems to reveal more about the dirty tricks played by companies like Cambridge Analytica, including deliberately spreading misinformation, to try and influence electorates in favour of whoever happens to be paying them.
This is a problem because it hands more power to those with money — people and groups who already have plenty — and away from ordinary people, who democracy is supposed to serve. Whether our political preferences happen to be served by recent fake news or not, everybody should be concerned about this.
It is also becoming increasingly difficult to doubt that certain foreign states are even interfering in elections and referendums of other countries in a significant way, by spreading fake news to those who are most willing to receive it.
The recent advances in the availability of artificial intelligence presents us with an opportunity to tackle this problem at scale.
Machine learning classifiers
Classifiers are machine learning models that can take a set of examples and learn which class each example falls into. You can then ask it to classify content that it has never seen before, with impressive accuracy.
For example, let’s say we want to teach a model to classify neutral vs biased article titles. We would gather as many examples of each class as we could:
class example neutral China's space lab set for fiery re-entry biased 10 Reasons Progressive Liberals Should Vote for Trump neutral Thousands of violent crime suspects released biased Right wing people are stupid neutral New £30m fund to help rough sleepers
Then we would use a technology like Classificationbox to train a model based on these inputs.
How the model learns is very complicated, and extremely difficult to understand if you try and poke around at the internal data structures it uses, but we know it works because we can test it by sending in some of our examples without providing the class, and asking it to make a prediction.
The number of these test predictions it gets right is how we measure accuracy.
Who decides what’s neutral or biased?
It is unlikely we will all agree on the classes in the examples above, so how do we decide what’s neutral and what’s biased?
The model is only as good as the training data, and it will indeed be biased based on who gathered that data. It would of course be possible to train a model that thinks heavily biased statements are neutral, and vice versa.
In a real world situation, the answer is probably to have the training data open, so that everybody can contribute to, and maintain it. Otherwise you have to trust the people who trained the models.
Machine Box provides Fakebox, a fake news classifier trained with significant datasets based on common sense classification of news articles. Any articles that could fairly be considered only slightly biased were not included, so the model does a good job in most people’s eyes.
In this article, we will see how we can build our own Fakebox alternative, using our own training data using Classificationbox.
Prepare training data
The simplest way to prepare training data is to create a folder — if you’re doing this as a team, then a shared folder might work best (like DropBox, or Google Drive, or some internal network location) so everybody can contribute.
Create a subfolder for each class, for example:
/training-data /biased /neutral /satire /junksci
Then put each example into a text file inside the appropraite folder, like this:
/training-data /biased biased-example1.txt biased-example2.txt biased-example3.txt
Balance the classes
One key principle is that each class should have more or less the same number of examples. If you have more examples of junk science than anything else, the model will likely be biased towards that class.
How much data?
The best number of examples differs widely depending on a number of factors, but you should start with at least 100 examples in each class, and go from there.
The teaching approach
The best way to teach the classifier, is to take 80% of your example data and teach it.
In Classificaitonbox, this can be done with a simple HTTP POST request:
POST /classificationbox/models/1/teach
{
"class": "biased",
"inputs": [
{"key": "content", "type": "text", "value": "...content..."}
]
}
The slightly verbose
inputs
array andkey/type/value
objects exist because you can actually teach classifiers with a range of different data types, including numbers and even images.
Then take the remaining 20% of the data, and ask Classificationbox to try and guess which class it belongs in. You do this using a predict request:
POST /classificationbox/models/1/predict
{
"inputs": [
{"key": "content", "type": "text", "value": "...content..."}
]
}
Notice that we do not provide the class in this request.
Classificationbox also supports multiple languages, which you can specify by using the language code in the type, for example
test_sp
for Spanish language.
Meet the textclass tool
The textclass tool is a simple Go program that walks a folder structure like the one described above, and performs the teaching of Classificationbox for you.
You can install it (assuming you have Go installed) by popping this into a terminal:
go get github.com/machinebox/toys/textclass
Run Classificaitonbox
You can run Classificationbox for free by going to your terminal, and doing:
$ docker run -p 8080:8080 -e "MB_KEY=$MB_KEY"
machinebox/classificationbox
- You’ll need to have Docker installed
- If you don’t have an
MB_KEY
, you can get one for free from the Machine Box website
Train and validate
In a terminal, run the textclass tool:
textclass -src /path/to/training-data
You will be prompted a few times to hit Y to confirm what the tool will do:
Classes — — — - fake: 300 item(s) real: 300 item(s) satire: 300 item(s)
Create new model with 3 classes? (y/n): y new model created: 5abe0d3302484439 Teach and validate Classificationbox with 720 (80%) random items? (y/n): y
After some time, you will be presented with the results:
Validation complete
Correct: 173 Incorrect: 7 Errors: 0 Accuracy: 96%
So we now have a classifier that can make predictions, with 96% accuracy.
Put into production?
If you want to put your model into production, you can — since Classificationbox is just a Docker container, you can save the state file, and use it when spinning up new instances in your own environment, or in the cloud.
If you share the same state file with multiple instances of Classificationbox, you can load balance the traffic to achieve planet scale.
Need help?
We hang out all day in the Machine Box Community Slack, and you’re invited to join us to ask questions, or tell us about what you’ve built.