MachineBox OCR
02.22.19

The Evolution of OCR Technology

Summary: 

  • Optical character recognition has been around for decades but has become even more advanced thanks to artificial intelligence. 
  • Traditional optical character recognition software lacked the scalability needed to handle modern media workflows. 
  • Enterprise AI platforms can remove the need to build custom deployments for ML content pipelines 

Optical Character Recognition (OCR), also referred to as text recognition, has been around for decades. The first OCR machine, the retina scanner, was invented by Charles R. Carey in 1870. Evolving throughout the 21st century, OCR software today converts physical, printed documents into machine-readable text. In other words, rather than just scanning a document and making an uneditable image, you can convert that into text you can edit. 

OCR has historically had limitations. However, with the advent of artificial intelligence (AI) and machine learning (ML), text recognition capabilities have become more advanced. We’ll explore why and when you should use OCR and how AI has improved the process. 

How OCR Makes Paperless Operations More Organized 

Many organizations have moved to paperless for various reasons, from minimizing their environmental impact to adjusting to a remote workforce. But one of the primary reasons is the fact that paper files can easily get lost and require manual work to surface specific documents.

For example, a lawyer of 30 years typically has many file cabinets filled to the brim with paper, including contracts, agreements, trusts, memos, and letters. The challenge becomes when you need a specific file, it’s difficult to find that item again, given how much any organization accumulates over the years. 

If you are a lawyer and have been for 30 or so years, you’ve probably got lots and lots of file cabinets filled to the brim with paper as well as a general malaise about being a lawyer for so long. I’m sure there are contracts, agreements, trusts, memos, letters, and even the occasional random menu from the local Chinese take-out place stuffed into hanging file folders and other fascinating file cabinet accouterment. What is annoying is that you’ve spent your life working very hard to keep things organized so that you can find relevant docs again when you need them.

Most scan their documents to have digital copies of everything. But that only protects against the destruction of the physical copy. And even with a digital copy, you aren’t able to search the content within them. At least not easily, depending on the file type—that’s where OCR comes in. 

When you scan your documents, instead of creating a photo, you can actually create a document that contains the text on the screen that the computer can understand. For example, it’s the difference between scanning a receipt into an image versus into a Word doc that you can edit. 

However, OCR, at least in the traditional sense, becomes challenging to use when approaching video files. To explain why OCR struggles with video, let’s take a look at the following example. 

https://www.geekwire.com/2017/amazons-first-nfl-live-stream-overcomes-early-glitches-long-weather-delay/

If you run this through one of the many cloud APIs for OCR out there, this is the kind of data you get back:

{
      "description": "10",
      "boundingPoly": {
        "vertices": [
          {
            "x": 400,
            "y": 628
          },
          {
            "x": 419,
            "y": 628
          },
          {
            "x": 419,
            "y": 662
          },
          {
            "x": 400,
            "y": 662
          }
        ]
      }
    },

Ten? What does that number mean? Is that the score or something else? Every number, and in some cases, every letter, has its own entry in the results. This data isn’t very useful at scale because its origin is unclear. That all changes with the introduction of ML capabilities. 

You can narrow in on the number you really want using Veritone aiWARE, a hyper-expansive Enterprise AI platform, and its text recognition models such as Objectbox. For instance, let’s assume you want to extract the game clock from the screen. A company managing hundreds of hours of NFL content might need this capability to make its operations more organized and scalable.

Ingesting sample video into the platform environment, you can draw boxes around things in a video, which the platform uses to train the model. Drawing boxes around the time clock only, you can train the model with 4 or 5 examples across 3 or more videos.

Fortunately, you’ll know if you’ve used enough examples with the initial feedback the tool gives you on detection. After completing model training, you can use that same model to detect the play clock on all of your NFL videos per our example. You can then automate this process by building a workflow with Veritone Automate Studio, an aiWARE low-code/no-code workflow builder tool 

With aiWARE, you can solve these challenges in production without having to build your own deployment schemes around ML pipelines. Explore Veritone aiWARE today, and it’s an ecosystem of over 300 AI models, including the latest and greatest in OCR.


Updated 9/13/2022
Was originally published in 2019 on Medium by Machinebox, a Veritone Company.