Article

Mar 26, 2026

Your Product Logs Are a Better Dataset Than Anything You Can Buy

If you walk into almost any SaaS company today, the data stack looks deceptively modern. Product teams live inside Mixpanel or Amplitude, watching funnels and retention curves. Support teams operate out of Zendesk or Intercom, handling thousands of tickets that quietly capture every edge case the product fails to explain. Engineering has logs flowing through Datadog, maybe stored in S3, sometimes piped into Snowflake if someone cared enough to model them. And somewhere in the background, backups are running through systems like Duplicati, quietly storing everything that matters and nothing that is actually used. On paper, this looks like a company that is “data-driven.” In practice, all of these systems are optimized for observation, not learning. The product analytics stack tells you what happened. It does not help you build something that improves on it...

The dataset you already have, but don’t use

Take a simple example: a B2B SaaS company trying to improve onboarding.

The product manager opens Mixpanel and sees a drop-off at step three of the onboarding flow. They can segment users, look at cohorts, maybe run an A/B test. If they want deeper insight, they export data to a notebook, join it with some Snowflake tables, and try to reason about patterns manually.

At the same time, support tickets are piling up in Zendesk. Users are asking questions like “I don’t understand what this step does” or “this integration failed but I don’t know why.” Those conversations are rich with context, but they live in a completely separate system.

Then there are the application logs. Every failed API call, every retry, every timeout is captured in Datadog or CloudWatch. This is the ground truth of what actually happened in the system, but it is rarely connected to user behavior or support conversations in a meaningful way.

Now layer on one more reality: all of this data is already being backed up. Product events, support tickets, logs, internal Slack discussions about bugs and fixes. Years of it.

It is a complete historical record of how the product evolved, how users behaved, where they struggled, and how the team responded.

And yet, when the company decides to “do AI,” none of this becomes the training dataset.

Instead, they look outward. They buy generic datasets, plug into third-party tools, or layer AI on top of Mixpanel dashboards that were never designed for model training in the first place.

Why the current stack breaks the moment you try to build AI

The reason this happens is structural.

Mixpanel and Amplitude are built for aggregation, not reconstruction. You can ask, “what percentage of users dropped off?” but not easily reconstruct the full sequence of events, context, and outcomes needed to train a model.

Zendesk and Intercom contain unstructured, high-signal data, but they are siloed. You can search tickets, but you cannot continuously feed them into a training pipeline that updates as new interactions happen.

Snowflake and Databricks solve storage and querying, but they rely on someone explicitly deciding what data to pipe in, clean, and model. The highest-value data often never makes it there because it is messy, fragmented, or locked in systems that were not designed for analytics.

And backups, which actually contain the most complete version of reality, are treated as cold storage. They are optimized for recovery, not for iteration.

So the company ends up in a strange position: it has all the ingredients to build deeply contextual, high-quality AI systems, but none of the infrastructure to turn that data into something usable.

What changes when you treat backups as training data infrastructure

The shift is not about adding AI features to existing tools. It is about changing where the source of truth lives.

Instead of asking teams to manually stitch together Mixpanel exports, Zendesk tickets, and log data, you start from the one system that already has all of it: your backups.

Duplicati turns that historical layer into something you can actually build on.

Data that was previously locked in backup archives is continuously extracted, structured, and versioned into formats that machine learning systems can use. Product events, support conversations, and system logs are no longer separate streams; they become a single, time-indexed dataset that reflects how the product behaves in the real world.

From there, the workflow starts to look very different.

Instead of a product manager staring at a funnel chart, the team can train a model on actual user journeys, including the friction points captured in support tickets and the failures recorded in logs.

Instead of running isolated A/B tests, they can simulate how changes would affect onboarding by learning from years of prior behavior.

Instead of manually tagging and categorizing support tickets, they can build systems that automatically identify patterns, predict issues, and even suggest product changes based on historical outcomes.

This is not replacing analytics with a better dashboard. It is replacing analytics with a system that learns.

A concrete example: replacing product analytics with a learning system

Go back to the onboarding problem.

In the current world, the team might say: “Step three has a 40% drop-off. Let’s test a new UI.”

In a system built on top of backup-derived training data, the question becomes: “What actually happens to users who reach step three, across every signal we have?”

The model can look at:

  • The exact sequence of product events leading up to the drop-off

  • The support tickets those users created before or after

  • The system logs that show whether something failed silently

  • The historical outcomes of similar users who did complete onboarding

Instead of guessing, the team gets a prediction grounded in its own history: which users are likely to fail, why they fail, and what interventions have worked before.

That is a fundamentally different level of insight than anything Mixpanel or Amplitude can provide.

Why this matters economically

SaaS companies already spend heavily on analytics, customer data platforms, and increasingly, AI tools layered on top of them.

But most of that spend is trying to extract more value from systems that were never designed to be training data pipelines.

At the same time, they are sitting on years of proprietary data that is far more relevant than anything they could buy.

Duplicati changes the equation by turning an existing cost center, backup storage, into the foundation for AI systems.

Instead of paying for multiple layers of analytics and external data, companies can build directly on their own operational history, reducing tool sprawl while increasing the quality of their models.

The shift SaaS teams will need to make

For most teams, the hardest part is not technical. It is conceptual.

They are used to thinking of backups as insurance, something you hope you never need. They are used to thinking of analytics as a separate layer, something you query after the fact.

But if you step back, the most complete, least biased, and most valuable dataset your company has is not in your analytics tools.

It is the full record of how your product has been used, how it has failed, and how it has evolved over time.

That record already exists.

The question is whether you continue to store it, or finally start learning from it.

Get started for free

Pick your own backend and store encrypted backups of your files anywhere online or offline. For MacOS, Windows and Linux.

Pick your own backend and store encrypted backups of your files anywhere online or offline. For MacOS, Windows and Linux.

  • Example image