Mastering Data Asset Management in the Age of Unstructured Data

AI systems, applications and workflows run on data. But not all data is created equal. And not all organizations are ready to harness it. Traditional Digital Asset Management (DAM) and Media Asset Management (MAM) systems were designed to catalog structured files like images, marketing collateral, and finished videos. While these systems remain useful, they fall short in today’s environment, where unstructured data contained within video, audio, PDFs, emails, and other digital media, makes up over 80-90% of the world’s data.

The scale and complexity of this unstructured content have created a new operational challenge. How do you find, govern, and extract value from massive libraries of data that can no longer be managed manually? AI has helped solve that challenge, but it all starts with data.

The consequences of poor data asset management are significant. This goes beyond the traditional problem in DAM and MAM discussions of content visibility and discoverability. This is about making data usable for AI. No AI system or machine learning model can effectively understand content if it was not built to do that task.

This is why organizations are moving beyond just managing the assets themselves but embracing the management of the structured and unstructured data of these files to transform them into strategic assets across its entire lifecycle.

Content as data: the strategic mindset shift

To fully unlock the value of unstructured content, businesses must adopt a new way of thinking—content is data. The smallest piece of data that the model works with is called a token.

A token represents the smallest piece of text an AI model works with when reading, interpreting, or generating language. A token might be a whole word, a fragment of a word, a punctuation symbol, or even a blank space. Through a process known as tokenization, text is split into these smaller components. By analyzing how tokens relate to one another, the AI model can understand context, build meaning, and produce responses—though always within the boundaries of its maximum token capacity.

This is not just “better tagging.” It’s a fundamental rethinking of how organizations perceive the relationship between their data and the state it’s currently in and their ability to effectively use AI. And the only way to operationalize this is by preparing the data so that it becomes usable by whatever AI system they are trying to leverage.

But preparing data as tokens is only the first step. To make that data truly usable across different applications and contexts, AI systems need a way to extend their understanding beyond their initial training.

Achieving extensibility with AI systems

AI extensibility is essentially the ability for systems to access and leverage external data beyond what they were originally trained on. In practice, this means making text, images, audio, and other forms of information available to an AI in a format it can “read” and understand.

At its core, an AI system is mathematical; it doesn’t inherently recognize a word like apple or interpret the pixels in a photograph. Vector embeddings bridge this gap by converting any kind of data, such as text, images, and audio, into a list of numbers (a vector) that captures both meaning and context.

For instance, in a text embedding, the vector for king might appear close to queen but far from car. This proximity reflects semantic similarity in a multi-dimensional space. Put simply, embeddings convert raw data into a structured, machine-readable format that preserves meaning and relationships.

Beyond text, multimodal embeddings extend this idea by placing different data types—such as images, audio, and text—into the same space. For example, the embedding vector of a picture of a dog is positioned closer to the text embeddings of dog or husky than to car. This shared space makes it possible to retrieve images based on a text prompt, cluster data into categories, identify outliers, or build specialized models like image classifiers or speech recognizers. Such embeddings are often trained through contrastive learning, where models learn similarities by comparing matching and non-matching pairs.

A universal language for AI

By turning all data into vectors, AI systems gain a universal “language” for processing diverse information. This enables a range of advanced capabilities, such as:

Semantic search: going beyond keyword matching, AI can retrieve documents, images, or clips that are conceptually related to a query.
Recommendation systems: by comparing user preference vectors with content vectors, AI delivers highly personalized suggestions.
Data integration: disparate sources, like documents, media files, structured datasets, can all be represented in the same format and stored in a vector database for unified access.

Vector embeddings are the foundation for an AI system’s ability to grow without retraining the entire model. Instead of rebuilding from scratch, where every new dataset would require retraining a large language model (LLM) or large multimodal model (LMM) from the ground up, organizations can use techniques like Retrieval-Augmented Generation (RAG). Because LLMs are limited by their context window (the number of tokens they can process at once), RAG uses semantic search over vector embeddings to surface only the most relevant information for a given query.

This is often combined with keyword search for added robustness, especially when dealing with rare, domain-specific terms or acronyms. When another LLM orchestrates these searches, formulating queries and sometimes performing multiple iterations—the process is known as agentic RAG. Together, these methods make systems more adaptive, scalable, and future-proof, enabling them to evolve alongside new business needs and data streams.

The role of AI in data transformation

The challenge of managing unstructured data isn’t just about organization—it’s about scale. Just a few months ago, Veritone reached a major milestone when the Veritone Data Refinery (VDR), powered by aiWARE, processed more than five trillion tokens across millions of hours of video and audio. This achievement demonstrates what’s possible when enterprise AI systems are designed for both performance and extensibility.

With the vast majority of enterprise data locked away in formats like broadcast archives, sports footage, marketing content, and security recordings, VDR changes that by transforming this raw content into structured, tokenized data enriched with governance and rights metadata. Once refined, this data can be searched, monetized, or used to train secure, proprietary AI models, all without retraining the entire system.

aiWARE’s ecosystem of more than 850 specialized AI models, which work in concert to ingest, analyze, and enrich multimodal content. This combination empowers enterprises to:

Accelerate discovery with semantic search and metadata enrichment.
Automate workflows from compliance redaction to content routing.
Enable monetization by converting dormant archives into licensable, revenue-generating datasets.
Scale securely with built-in governance and compliance.

What was once a differentiator is now the baseline. The companies that will lead are those that embed AI-powered data asset management deeply into their workflows, creating intelligent systems that don’t require retraining, continuously generating insights and value.

What’s next: diving into future topics

In the next part of this series, we’ll unpack the key pillars of modern data asset management in greater detail, starting with ingestion, where organizations first capture, structure, and secure their unstructured data.

Learn More About Veritone Data Refinery

Sources:

https://researchworld.com/articles/possibilities-and-limitations-of-unstructured-data