Article

Mar 26, 2026

Quant Firms Are Sitting on the Most Valuable Dataset They Don’t Use

A typical quant research workflow today is highly optimized, but narrowly scoped. A researcher pulls tick data from kdb+, joins it with alternative datasets sourced through Snowflake or a marketplace, engineers features in Python, and runs experiments tracked in MLflow or Weights & Biases. Results are pushed into internal dashboards and debated with portfolio managers. This system is fast, sophisticated, and expensive—and yet it is incomplete, because it systematically excludes the most valuable dataset the firm already owns: its own history of decisions.

Every trade ever executed, every model ever trained, every backtest that failed, every research note, every strategy iteration, and every internal discussion that shaped a decision exists somewhere in the organization. That corpus is longitudinal, proprietary, and directly tied to how the firm generates alpha. It is also largely unused. The reason is not philosophical; it is structural.

Where the Most Valuable Data Actually Lives

Quant firms are required to retain years of historical data for compliance—trade logs, execution data, research artifacts, model outputs, and communications. This data is preserved in backup systems built for durability and auditability. These systems are excellent at ensuring nothing is lost and everything can be restored, but they were never designed to support research workflows. They do not provide indexed retrieval, cross-time joins, or reproducible pipelines, which means accessing this data for experimentation typically requires brittle, one-off extraction layers.

In practice, this creates a clean separation. Operational systems like kdb+, Snowflake, and S3 are optimized for querying, speed, and iteration, while backup systems are optimized for immutability and compliance. Because these worlds do not connect, the most complete record of the firm’s behavior remains invisible to the research stack, and that disconnect defines how quant research works today.

Why Firms Buy Data They Already Have

Because backup data is inaccessible, firms look outward for signal. They spend heavily on alternative datasets—credit card panels, satellite imagery, web exhaust, and sentiment feeds—to augment their models. These datasets can be useful, but they are inherently generic; every fund can buy the same feeds, and at scale they become table stakes. More importantly, they lack context. They cannot tell you why a trade was made, how a model evolved, or which signals were systematically ignored.

That context exists only internally, and it is precisely what gives data its edge. Internal data encodes how the firm interprets the market, how strategies adapt across regimes, and how decisions are made under uncertainty. In an era where model performance is increasingly constrained by data quality, this distinction matters: external data approximates behavior, while internal data records it.

What Unlocking Internal Data Actually Looks Like

Consider a concrete example. A researcher wants to build a new model for semiconductor equities. Today, they would pull a decade of market data from kdb+, join it with a set of alternative datasets in Snowflake, engineer features, and run backtests. The workflow is well understood, but it effectively resets context at the start of each project.

Now imagine the same workflow with access to the firm’s historical record. In addition to market data, the researcher can query every past semiconductor strategy the firm has run, inspect the exact features those models used, analyze how performance changed across volatility regimes, read the research notes explaining why certain signals were introduced or removed, and examine execution data that shows how those strategies behaved in production.

Instead of starting from scratch, the researcher starts from accumulated experience. They can ask which features have historically worked for this sector, which signals have repeatedly failed and under what conditions, and how past models behaved during comparable market environments. This is not just more data; it is a different class of data—a record of decision-making. It enables models to be trained on past decisions, not just outcomes; it allows researchers to reconstruct historical environments for reproducible experimentation; and it surfaces patterns across strategies and teams that are otherwise invisible.

Some firms attempt to approximate this today by building custom pipelines that extract logs, rebuild datasets, and stitch together fragments of history. These efforts can work, but they are slow, fragile, and expensive to maintain because there is no underlying infrastructure designed for this use case.

The Real Problem Is Not Data. It Is Infrastructure.

The limitation is not that quant firms lack data; it is that their infrastructure was never designed to use it this way. Backup systems assume data is static and rarely accessed, while AI systems assume data is structured, queryable, and continuously updated. These assumptions conflict, so the most complete dataset a firm owns remains disconnected from the systems that could learn from it.

As a result, firms train models on partial views of reality while their full historical record sits unused. They pay once to store it for compliance and again to replace it with external data, creating a persistent inefficiency in both cost and signal quality.

Duplicati: Connecting the Two Worlds

Duplicati resolves this disconnect by making backup data usable inside the research stack. Instead of treating backups as static archives, it continuously indexes and structures historical data into formats that AI systems can consume. Trade logs, research artifacts, and model outputs are exported into Parquet or Delta Lake on S3-compatible storage and made queryable alongside existing datasets. fileciteturn0file0

This allows backup data to sit next to kdb+ and Snowflake within the same workflow. Researchers can join it with market and alternative data, feed it into feature engineering pipelines, track versions in MLflow or DVC, and incorporate it into continuous retraining loops. The workflow does not need to change; the dataset simply becomes richer, and with it, the models’ capacity to learn from the firm’s own behavior.

From External Signals to Internal Intelligence

When internal data becomes accessible, the nature of model development shifts from sourcing external signals to leveraging internal intelligence. Models can incorporate behavior, decision-making, and strategy evolution as first-class inputs. This is inherently difficult to replicate, because no other firm has access to the same dataset. It is proprietary by definition, and its value compounds over time as more decisions are recorded and learned from.

Turning a Cost Center Into an Edge

Backup infrastructure has historically been treated as a cost center that satisfies compliance requirements and protects against failure, but does not contribute to returns. In a world where AI performance is driven by data quality, that assumption no longer holds. The most important dataset is the one that captures how an organization actually operates, and for quant firms, that dataset already exists.

Duplicati does not add superficial AI features to backup systems. It connects backup to the systems where AI is built, turning a passive archive into active infrastructure. The firms that adopt this model will not just reduce external data spend; they will build models that reflect their own accumulated intelligence—an advantage that cannot be purchased, only unlocked.



Get started for free

Pick your own backend and store encrypted backups of your files anywhere online or offline. For MacOS, Windows and Linux.

Pick your own backend and store encrypted backups of your files anywhere online or offline. For MacOS, Windows and Linux.

  • Example image