Article
Mar 26, 2026
Backup Companies Optimized for Disaster Recovery Will Lose the AI Era
There is a moment that happens inside almost every enterprise once they seriously try to build AI. It usually starts in a place that feels modern. A data team is working in Snowflake or Databricks. Models are being prototyped in Python using Pandas, PyTorch, or XGBoost. Product teams are piping logs into S3. Maybe there is even a vector database like Pinecone or pgvector starting to take shape. On the surface, it looks like a company that is becoming “AI-native”...
And then someone asks a simple question:
Can we train this on everything we already know as a company?
Not just the last six months of clean tables. Not just the curated datasets sitting in the warehouse. Everything. The decisions, the edge cases, the failures, the customer interactions, the system states, the things that actually define how the business operates.
That is where things break.
Because all of that history exists. It is just not where anyone building AI can reach it.
It is sitting inside systems like Rubrik, Cohesity, and Veeam.
And those systems were never designed to be used this way.
The System Everyone Has, But No One Uses
If you walk into almost any mid-sized or large company today, the architecture looks roughly the same.
Operational systems generate data. Product logs, application databases, internal tools, emails, file systems. That data flows into analytics layers like Snowflake or Databricks where teams can query, transform, and build dashboards.
Separately, everything is backed up.
That backup layer is comprehensive. It contains years of history. Full system snapshots. Versions of files that no longer exist anywhere else. Records of how things actually changed over time, not just their current state.
It is also completely disconnected.
The analytics stack is optimized for speed and structured queries. The backup stack is optimized for recovery. If something breaks, you restore it. If something is deleted, you recover it. That is the entire design philosophy.
So when an AI team tries to access real historical data, they do not start with backups. They start building pipelines. They instrument new logging. They buy external datasets. They reconstruct fragments of history in places like S3.
Meanwhile, the most complete version of the company’s own data is sitting in a system that is treated like a vault you are only allowed to open during emergencies.
This is not a tooling gap. It is a structural one.
Backup systems were never meant to be part of the feedback loop.
Why Incumbents Cannot Bridge This Gap
Companies like Rubrik, Cohesity, and Veeam are not behind because they lack engineering talent. They are behind because their entire product philosophy is anchored to a different problem.
They are designed around disaster recovery.
Everything flows from that constraint. Data is stored in proprietary formats optimized for durability. Access is controlled, often slow, sometimes intentionally painful to prevent misuse. The primary interface is not a query layer or an API for experimentation, but a restore workflow.
Even when these companies add AI features, the direction is predictable. You get better search across backups. You get anomaly detection. You get compliance automation. You might even get conversational interfaces that let you ask questions about what is stored.
But the core assumption never changes.
The data is something you retrieve when needed, not something you continuously work with.
AI does not work that way.
Modern AI systems require iteration. They require being able to run experiments, retrain models, compare outputs across time, and simulate decisions under different historical conditions. They require datasets that are not just snapshots, but timelines.
A system built to restore a file is fundamentally different from a system built to replay how that file evolved, why it changed, and what decisions were made around it.
That is why adding AI features to a recovery-first platform does not solve the problem. It only makes the archive slightly more searchable.
What It Looks Like When You Actually Try to Use This Data
A SaaS company wants to build a model that predicts churn more accurately. Today, they might pull data from their warehouse, join together product usage metrics, maybe enrich it with support ticket data, and train a model in a notebook.
It works, but it is shallow. It only captures what was intentionally logged and structured.
What they really need is everything that led up to churn. The full sequence of events. Feature usage patterns that were never formalized into metrics. Historical versions of accounts. Internal conversations about customer issues. Edge cases that never made it into dashboards.
All of that exists.
It is in backups of their application databases. Their support systems. Their internal tools. Their file storage. It is the closest thing the company has to a complete memory of its own behavior.
But extracting that data today is a project. It involves one-off scripts, manual approvals, partial restores, and a lot of engineering effort just to get something usable into S3 or a data lake.
So most teams do not do it.
They settle for the subset of data that is easy to access, and then they try to compensate by buying external data or over-engineering features.
The result is predictable. Models that look sophisticated but miss the actual dynamics of how the business operates.
The Shift: From Storage to Activation
The core idea behind Duplicati is simple, but it cuts directly against how backup systems have been built for the last twenty years.
Backups are not just insurance. They are the most complete dataset a company owns.
The problem is not that the data is missing. It is that it is not usable.
Duplicati treats the backup layer as a starting point for AI, not the end of a storage pipeline. Instead of keeping data locked in formats designed only for recovery, it continuously transforms it into something that can be worked with.
That means exporting data into formats like Parquet or Delta Lake on S3-compatible storage. It means indexing and versioning it so you can understand not just what exists, but how it changed. It means making it queryable, vectorizable, and compatible with the tools teams are already using, whether that is MLflow for experiment tracking or standard Python workflows for modeling.
Now, when a team wants to build a churn model, they are not starting from a thin slice of recent data. They are starting from the company’s full operational history.
They can train on real sequences of events. They can simulate alternative decisions. They can see how similar situations played out years ago and how the outcomes differed.
The backup stops being a dead archive and becomes a continuously updated training layer.
Why This Changes the Economics
There is a second-order effect that matters just as much as the technical one.
Right now, companies pay for their data twice.
They pay to store it for compliance and recovery. Then they pay again to acquire or engineer datasets for AI. In many cases, the second spend is larger than the first.
A mid-sized firm might spend hundreds of thousands of dollars a year on external data and data engineering, while their own historical data remains largely unused.
When you make backups usable, that equation changes.
You are not just reducing costs. You are changing where value comes from. Instead of relying on generic datasets that everyone else can access, you are training on the one thing your competitors do not have: your own history.
That is where differentiation actually comes from.
The Companies That Win This Transition
This is not a feature race. It is not about who adds the best AI assistant to their product.
It is about which systems become part of the core AI workflow.
Backup companies that remain anchored to disaster recovery will continue to exist, but they will sit outside the loop where models are trained and decisions are made. They will store the past, but they will not shape the future.
The systems that win are the ones that turn historical data into something that can be iterated on continuously. The ones that let teams move from “what do we have stored?” to “what can we learn from everything we have ever done?”
Duplicati is built around that shift.
Not as backup software with AI features, but as infrastructure that makes a company’s own history usable for AI in the first place.
Because in the end, the most valuable dataset is not something you buy.
It is everything you have already lived through, if you can finally use it.



