Factories Have 20 Years of Failure Data. None of It Trains Their AI Models

Article

Mar 26, 2026

Factories Have 20 Years of Failure Data. None of It Trains Their AI Models

Walk into a modern factory and the story sounds familiar. There is no shortage of data. Machines stream logs into historians. Maintenance teams record work orders in systems like SAP PM or IBM Maximo. Sensors feed dashboards built on tools like Ignition, OSIsoft PI, or even custom pipelines into Snowflake or Databricks. When something breaks, there is usually a record somewhere—often several...

And yet, when a manufacturing team decides to invest in predictive maintenance or AI-driven optimization, they almost always start from scratch.

They buy an external dataset. They onboard a vendor. They install a new IoT platform. They stand up a parallel pipeline that looks nothing like the systems already running their plant.

The disconnect is not a lack of data. It is that the most valuable data they already have is locked in a place no one thinks to use: backups.

The reality inside a factory

Consider a mid-sized manufacturer running multiple production lines. Over the past decade, they have accumulated:

Machine logs from PLCs and SCADA systems
Maintenance records from ERP systems
Incident reports stored in SharePoint or internal tools
Technician notes, emails, and shift summaries
Historical configuration changes and firmware updates

Most of this data is already being backed up continuously. It exists in full fidelity. It spans years of operations, including every failure, near-miss, and repair.

But if you ask the team building AI models where their training data comes from, the answer is almost always something else.

They export a subset of recent data into a clean table. They normalize it manually. They discard anything messy or unstructured. Then they augment it with vendor datasets or synthetic signals because the internal data is too hard to access and too inconsistent to use.

So the model ends up learning from a simplified, partial version of reality, while the complete operational history sits untouched. This is not a tooling gap. It is a structural one.

Backup systems were designed for recovery, not for iteration. They are optimized for restoring a server, not reconstructing the context around why a machine failed three years ago under a specific combination of load, temperature, and operator behavior.

Why current AI approaches fall short

Most predictive maintenance vendors solve this by introducing entirely new infrastructure. They install sensors. They build a clean data pipeline. They run models on top of that pipeline. Over time, they promise better predictions.

But this approach has two problems.

First, it ignores the past. A factory might have decades of failure data, but the model only sees what has been collected since the new system was installed. That means relearning patterns the organization has already experienced, often at significant cost.

Second, it fragments the system. Now there are two sources of truth: the operational systems that actually run the factory, and the AI system that tries to interpret it. Keeping them in sync becomes its own engineering problem.

The result is that AI remains an add-on rather than becoming part of how the factory actually operates.

A different starting point: treat backups as the dataset

Duplicati takes a different view.

Instead of asking how to build a new data pipeline for AI, it starts with the data that already exists in backup systems and treats that as the foundation.

This is the same core idea outlined in the broader shift toward AI-native infrastructure, where the most valuable training data is not something you buy but something you already own and have been collecting for years .

In a manufacturing context, this means taking the full historical record—machine logs, maintenance events, communications, and system states—and making it usable for model training without breaking compliance or operational constraints.

What this looks like in practice

Imagine a reliability engineer trying to answer a simple question:

“Have we seen this failure mode before, and what actually fixed it?”

Today, that answer requires digging through multiple systems. Logs in one place, maintenance tickets in another, emails somewhere else. Even if the data exists, connecting it is slow and manual.

With Duplicati, the workflow changes.

Backup data is continuously extracted and structured into formats that can be queried and analyzed. Logs are aligned with maintenance records. Unstructured notes are indexed. Time becomes a first-class dimension, so you can reconstruct what was happening across the system at any given moment.

Instead of pulling a small dataset into a model, the model can train directly on the full operational history.

This allows for use cases that are difficult or impossible with current approaches:

Training a model on every historical failure across all machines, not just recent sensor data
Identifying patterns that only appear over long time horizons, such as seasonal wear or rare edge cases
Reconstructing the exact sequence of events leading to a failure, including human decisions and system changes
Comparing how similar issues were resolved across different plants or teams

The key shift is that the model is no longer learning from a curated dataset. It is learning from reality.

Replacing the need for new infrastructure

Once this layer exists, many of the reasons companies buy separate predictive maintenance platforms start to weaken.

Instead of paying for external IoT datasets, the organization can use its own historical signals, which are often richer and more relevant.

Instead of building custom pipelines to move data into a data warehouse, the pipeline is already implicit in the backup system, which contains a complete, versioned record of the business.

Instead of maintaining parallel systems for operations and analytics, both can operate on the same underlying data.

This does not eliminate the need for tools like Snowflake or Databricks, but it changes their role. They become places where models run and experiments happen, rather than the primary source of truth for data.

The economic shift

Manufacturers already spend heavily on three things:

Maintaining backup and storage for compliance and recovery
Building and maintaining data pipelines
Purchasing external tools and datasets for analytics and AI

What Duplicati does is collapse these into a single layer where the backup system is no longer just a cost center but a source of leverage. The same data that exists for compliance becomes the foundation for optimization, prediction, and automation.

That is a fundamentally different economic model. It turns something every factory already has into something they actively use.

The broader implication

Manufacturing has always been about learning from experience. Continuous improvement systems, root cause analysis, and lean methodologies all rely on the idea that past failures contain the blueprint for future performance.

The irony is that most of that experience is already captured digitally, but it is trapped in systems that were never designed to be explored.

What changes with an AI-native approach is not just better models. It is the ability to operationalize the full memory of the factory.

When that happens, predictive maintenance is no longer about installing new sensors or buying new software. It becomes about finally using the data the organization has been collecting all along.