Article
Mar 26, 2026
Why Companies Spend Millions on Data They Already Own
There’s a moment that happens inside almost every data-driven company that is trying to adopt AI. A team decides they need better data to train models. They open up procurement, start evaluating vendors, and within a few weeks they are paying for access to datasets through marketplaces—credit card transactions, satellite imagery, sentiment feeds, supply chain signals. If they are already using Snowflake, the path is even easier. The Snowflake Data Marketplace makes it feel like progress: click, subscribe, query. It looks like momentum. It feels like sophistication. It is almost always the wrong starting point. Because at the exact same time, in the same company, there are years—often decades—of proprietary data sitting in backups, untouched. Not because it lacks value, but because the systems that store it were never designed to make it usable...
The System Everyone Already Has
If you walk into a typical company building AI today, the stack looks something like this:
Product and event data flowing into Snowflake or Databricks
Experiments and models tracked in MLflow or Weights & Biases
Internal tools built on top of S3, dashboards, and Python notebooks
And then, sitting entirely outside of that loop:
Backups in Veeam, Rubrik, or Cohesity
Email archives, historical databases, logs, documents, internal communications
Years of decisions, edge cases, failures, and outcomes
These two systems rarely connect. One is optimized for querying clean, recent data. The other is optimized for compliance and recovery.
So when an AI team asks, “what data should we train on?”, they look at what is accessible, not what is valuable.
That is how companies end up buying data they already have.
A Concrete Example That Feels Familiar
Take a mid-sized SaaS company trying to improve its customer support automation.
The AI team wants to build a model that can predict escalation risk and suggest responses. They start with what is easy:
Export recent tickets from Zendesk
Join with product usage data in Snowflake
Maybe augment with external datasets on customer behavior or sentiment
It works, but only to a point. The model struggles with edge cases. It misses patterns that experienced support reps intuitively understand.
What’s missing is not more data from outside. It is depth from inside.
Because somewhere else in the company:
There are 5–10 years of historical tickets, including resolved escalations
Internal Slack threads where engineers diagnosed root causes
Emails between account managers and customers during critical incidents
Product logs that show exactly what happened before each issue
All of this exists. Most of it is already being backed up every day.
None of it is being used.
The Structural Problem
This is not a tooling gap. It is a structural mismatch.
Data marketplaces and warehouses are built around snapshots:
Clean tables
Defined schemas
Recent slices of data
Backups are built around history:
Full timelines
Raw, unstructured context
Every version of what happened, not just the latest state
Modern AI systems need the second far more than the first.
They need to understand sequences, decisions, failures, and outcomes over time. They need to train on how the business actually operated, not just how it is currently modeled in a warehouse.
But accessing backup data today usually means:
Filing a request with IT
Waiting for a restore
Dealing with proprietary formats
Writing one-off scripts to extract anything usable
So teams don’t do it. They default to what is already queryable, even if it is incomplete.
What Changes When Backups Become Infrastructure
The shift is not about adding AI features on top of backups. It is about changing what backups are in the system.
Duplicati treats backups as a continuous training data layer, not a cold archive.
Instead of requiring manual extraction, it does three things:
Structures historical data automatically
Backup data is continuously exported into formats like Parquet or Delta Lake, stored in S3-compatible systems, and made compatible with existing tools.Preserves context and time
Rather than flattening everything into tables, it keeps the sequence of events intact—what happened, when, and in what order.Integrates directly into ML workflows
The same pipelines that pull from Snowflake or Databricks can now pull from backup-derived datasets, with experiment tracking in MLflow or Weights & Biases.
This means the support model example changes in a very practical way.
Instead of training on six months of cleaned ticket data, the team can train on:
Ten years of support interactions
Every escalation path and resolution
Internal discussions that led to fixes
The full lifecycle of customer issues
The model stops being a thin layer on top of recent data. It becomes a reflection of how the company has actually learned over time.
Why This Replaces Data Marketplaces
Data marketplaces are not inherently bad. They are useful when you genuinely lack data.
But most companies are not data-poor. They are access-poor.
When you unlock internal history:
You reduce reliance on generic external datasets
You train on signals that are unique to your business
You eliminate the lag between data generation and model training
The economic impact is straightforward.
Companies often spend hundreds of thousands, sometimes millions, on external data subscriptions while also paying to store their own data for compliance. If even a fraction of that external spend can be replaced with internal data, the savings are immediate.
More importantly, the models improve.
Because no external dataset can replicate your exact customers, your exact operations, or your exact decisions over time.
The Real Shift
The deeper change is not financial. It is conceptual.
Most companies think of their data stack as:
Warehouses for analytics
Marketplaces for enrichment
Backups for disaster recovery
That model made sense before AI.
In an AI-driven system, the most valuable dataset is not what you can query today. It is the full record of everything that has ever happened inside your company.
Backups are the only system that already contains that.
Duplicati’s thesis is simple: the companies that win in AI will not be the ones that buy the most data. They will be the ones that can actually use their own.
And for most of them, that data is already there, waiting in a system they have been treating as a cost center instead of infrastructure.



