Making Sense of Other “Documents”: Veritone AI for Video and Images
08.13.21

Making Sense of Other “Documents”: Veritone AI for Video and Images

Originally published in Document Imaging Report

The pandemic has placed a spotlight on the need for businesses of all types to pay attention to their unstructured content. As volume continues to skyrocket, companies are going to need to devise ways to understand what’s in all of those documents, videos, images, and phone/video calls, for governance and risk reasons, but also to create value

Ryan Bazler One technology that’s going to play a crucial role is artifcial intelligence (AI). We spoke with Ryan Bazler, VP of Marketing with Veritone, a software company that provides AI tools (among a wide variety of other technology) to help understand customer-facing content like phone/video calls, videos and images emailed or posted to social sites, customer texts, etc.

Here’s Veritone’s description of aiWARE:

Veritone and aiWARE

“The world’s first operating system for artificial intelligence, Veritone aiWARE, orchestrates a diverse ecosystem of ready-to-deploy machine learning models to transform audio, video, text, and other data sources into actionable intelligence, at scale, with no AI expertise. With aiWARE, leverage digital workers to save manual review time, gain valuable data insights, and cognitively enrich end-to-end workflows.”

We began our conversation with a question everyone in this industry is familiar with – are we finally seeing a technology that companies will adopt to uncover the value in the 80% of the data that they currently mostly ignore.

Bazler noted that much of this data is unusable even though it has value. Valuable insights could be locked inside of phone conversations, such as call centers, video, and communications over email, text, or social. “AI is well suited to operate on that content and extract useful information. The problem is, up till now, AI models have been disappointing because they typically have to be developed from scratch, which requires data scientists and ML engineers to build those models, and most fail in high-volume production settings” says Bazler.

That expertise isn’t cheap. Plus, the models need to be adjusted over time based on data seen in production. Bazler suggests this is why AI is having a hard time taking off.

Veritone’s answer has been to focus on creating an “AI operating system” they call aiWARE and a suite of ready-to-deploy models. Depending on the use case, some models are trainable and others can be used without upfront training. Another benefit of this approach is that “you can leverage models from multiple vendors. You can try one from Microsoft, Google, Amazon, Veritone, etc. and compare them to see which works best for you.”

What Veritone has built with aiWARE is essentially an ecosystem of AI models that any application or process can invoke. Says Bazler, “Veritone has what we call a VTN standard for each cognitive category of engines. It’s a universal way of inputting source data into the model and getting the output from those models regardless of the underlying vendor model being used, so switching between AI models is a snap and takes seconds to do.”

I asked if there’s a theoretical limit to an AI model that could be in this “AI operating system.” Other than the time to do the API integrations, there isn’t. “In fact,” Bazler claims, “aiWARE provides a suite of ready-to-deploy models across all the cognitive categories — computer vision, speech recognition, text analytics, and data analysis – while also allowing people to onboard models they’ve developed for production deployment.”

With clients like CNN, Bloomberg, and ESPN; Veritone is currently processing four years of audio and video content every day on its platform. Many times they are looking for a needle in a haystack, such as searching the archival footage of the San Francisco Giants to find the face of a baseball player during the time he played for a certain team or a home run on a particular day. The Giants are also using Veritone to transcribe and close-caption select footage for social media posts.

AI in the Cloud

aiWARE is a born-cloud platform that runs in AWS and Azure. Living close to DC, this seemed like a tool that all the three-letter acronym intelligence agencies would be interested in. Without specifics, Bazler did say that the platform is FedRAMP certified for Federal work and Veritone has a large government business running on both AWS and Azure GovClouds.

This ability to take AI and scale is a key competitive differentiator for Veritone, Bazler says. The platform can be run on a private cloud, isolated from other networks. Soon Veritone will come out with a purely network-isolated solution.

There are drawbacks to an on-premise rollout in that you have to plan for your own scalability and upgrades aren’t automatic, but some applications with PII and other sensitive information that cannot leave the network require it.

Customers and Uses

As Bazler says, there’s a wide-open field for use cases, “any use case that has audio, video, or images is really our sweet spot, plus text extraction for legal discovery. We work with many partners in the legal space to provide the analysis of any media source, to produce a search key of names, places, dates, locations, etc.”

The company is divided into business units, GLC (government, legal, and compliance), media and entertainment, and radio advertising, plus the horizontal aiWARE business unit which also focuses on AI for energy use cases. The company has built vertical, business-user-friendly applications on top of aiWARE and sells these to many of the major media channels and networks for video retrieval, as previously mentioned. The audio radio advertising unit offers applications that help advertisers determine the effectiveness of their ad spend.

One interesting use case is the Anaheim Police Department; among other police jurisdictions Veritone calls customers. The application can “automatically find the faces of suspects and redact them for privacy” in video. This is a frame-by-frame process normally. The software could also take the video feed from a gas station of a carjacking and then run the images against suspect databases that exist across the country.

This raises a variety of privacy issues beyond the scope of this newsletter, but it does conjure up images of the Tom Cruise movie, Minority Report.

Bazler, who has roots in the capture industry, notes the parallels in document capture use cases, applying the same concept of extracting value from content without the need for human review. Some customers had full-time employees whose job was to review video or images for faces or objects. Insurance is another good example. Adjustors need to review accident scenes, looking for damage, license plates, etc. Those elements can now be extracted automatically.

If some of this sounds similar to the promise of search technology in the late 90s/early 2000s, you’re not wrong. I joked with Bazler that a blog post on the Veritone site about the benefits of AI in ediscovery could’ve been published 15 years ago with “search” replacing “AI.” Of course, technology does improve. Bazler notes that AI can scale in the cloud in a way that many of tools in late 90s couldn’t, opening up new value streams for video insight, for example.

A final use case Bazler mentioned is transcription and translation. Bazler says, “Think of a contact center or customer service calls. Global organizations obviously need to transcribe and translate multiple languages. This allows companies to then do a whole bunch of NLP and NLU on that content, searching for trouble spots and weaknesses in customer service. One reason to transcribe is that it’s faster to search the text than the original audio or video content. This gets companies closer to being able to act in real-time rather than using AI to respond to historical data only. Things like being able to upsell a customer based on what they’re saying when they are speaking to a customer service or sales rep.”

We briefly discussed accuracy. As with all previous capture conversations I’ve had, “it depends.” The upshot: it works well, but ongoing model evaluation based on real-world data sets is always going to be needed, to meet business goals. Some of the engines in the aiWARE operating system are trainable, others aren’t.

One of the more interesting points in the conversation involved the output. You can integrate engine output with virtually any system.

Bazler sees aiWARE as meeting a critical need to infuse AI into existing RPA, BPA, ECM, BI, and legacy applications. These applications struggle with processing unstructured media sources – this is where aiWARE fills the gap. Customers can integrate aiWARE at the API level or use their Automate Studio workflow tool for a low-code integration approach. Either way, Veritone will typically send JSON output to the calling system, which consists of data output such as time-stamped words in a transcript, entities or sentiment scores in text, names of identified faces in video, speaker-separated timestamps in audio, etc. Basically structuring the unstructured.

Flashing back to Haystac’s training and the ability of that tool to apply what it learns from one customer to another (in the background, of course), I asked Bazler if aiWARE had a similar capability to learn based on user interactions with it. While customers can build up, train, and save models for the future, that’s within an individual account. Bazler did express interest in the concept though.

Models, Models Everywhere! How Do You Choose?

With so many models to choose from on the platform, the choice can be bewildering to companies new to AI. Veritone is developing an offering that “allows you to compare the performance and cost and bias and other behavioral variables between models in a very straightforward way. Essentially, it will help you decide which model to use initially based on your unique data sets.”

Over time, models can become less or more effective. Thanks to Veritone’s VTN Standard, switching to a different model is simple – select it from a drop-down menu in the Automate Studio product. Veritone is also developing a marketplace of different AI models and workflows. Currently, Automate Studio allows customers to “stitch together” workflows and engines with a drag-and-drop workflow design.

How This All Works for the Customer

I asked Bazler for an example of how a customer is currently using the aiWARE platform. He shared how the Pacific Northwest Metropolitan Police Department uses it to protect citizen privacy. The Department regularly has to redact thousands of hours of audio/video evidence.

A workflow engine ingests the evidence and then “through our facial recognition engines, redacts not only the faces, but what those ‘faces’ were saying as they were on-screen.” This is an example of multiple engines working together to solve a complex business problem. With the redaction application, Veritone also keeps the chain of custody logs and provides redacted content to the department.

Over the course of a year, this adds up to a million dollars of savings simply by eliminating the manual, frame-by-frame redaction process needed before.

Because of his roots in the imaging industry, Bazler sees a number of potential use cases “expanding the document capture value proposition where documents, audio, and video intersect.”

Take banking and financial services, obviously a large document capture market. Says Bazler, ‘you can apply AI to customer conversations over any channel to analyze and route those conversations to the right person for immediate handling.”

For these highly regulated industries, there’s a compliance play. Currently, maybe one out of 100 calls is sampled for QA for problematic language. With AI, you go from 1% to 100% of being able to check for compliance by identifying keywords that are an important part of the business process.

In insurance, AI can assist with fraud detection. One way is analyzing phone calls that are part of the first notice of loss process (someone stole my car, for example). All of the data (like phone and images) can be searched to spot words and dates to ensure dates, times, and locations all match between parties.

Also, think of the treasure trove of video content on YouTube and TikTok. Brands could use aiWARE to recognize and locate a brand’s logo or product in a video. Or perform social media sentiment analysis that can currently be done with the text in tweets applied to even more content. A company could do this now, but some poor schmuck is going to have to wade through and evaluate massive volumes of content and do a lot of right-clicking, save to manually download items of interest. Even so, as a sometimes marketing guy, that’s an exciting idea.

As transcriptions get closer to being possible in real-time, even more opportunities open up. AI agents mining call transcripts for sentiment analysis – unhappy, happy, frustrated, whatever – and providing feedback and options to the customer via a predetermined workflow. Over time, the line between an AI agent and a real agent can blur as an AI agent can mimic a person’s voice and only bring a human into the workflow when needed. The AI agent will always make the right choice based on keywords (and phrases). As a bonus, an AI agent won’t say anything off-script that could expose a company to any sort of risk. This, however, is not quite there yet.

While not directly related to DIR coverage, Veritone does have a voice solution called Marvel.AI. It generates a synthetic voice or a voice cloning capability based on text-to-speech or speech-to-speech technology. Theoretically, Morgan Freeman or Sam Elliot (who, in my opinion, should share all voiceover duties, everywhere) could have their voices synthesized and farmed out. They wouldn’t have to be present to record. Again, not relevant for our coverage, but interesting nonetheless.

I can’t help but think of all the search vendors I briefed with back in the mid to late-90s who were attempting to do what Veritone can do with video and voice now. My conversation with Seth Earley from last year is also relevant, as he pointed out that the current state of AI is, in some applications, simply (well, not so simply) better search.

The “intelligence” in AI depends on the ability to find keywords and phrases and have the engine determine what it “means.” Just keep in mind that it’s still just a computer crunching 1s and 0s, it’s not true intelligence yet.

To help put some of this in context, Richard Medina of and Bazler co-presented a webinar earlier this year, It does a good job of putting these ideas into context. Doculabs Capture 100% of Your Content – Applying AI at the Intersection of Documents and Media. It does a good job of putting these ideas into context.