How to identify and manage data uncertainty effectively

The challenge: data that looks solid but hides uncertainty

Data often appear as objective facts, ready for analysis and interpretation. However, in practice, it is riddled with uncertainty — the noise that manifests as missing values, systematic errors, ambiguous definitions, opaque transformations, or unknown measurement sources.

Uncertainty isn’t a technical footnote: it silently propagates through the entire workflow, distorting results, eroding trust, and fueling business errors. Beyond its many formal definitions and classifications, all uncertainty is a form of ignorance — especially the probabilistic assumptions inherent in AI models on which we so often base our decisions.

As Dennis Lindley taught us, when data is missing, when a model fails to represent the underlying phenomenon, or when we don’t understand how a field was transformed, the failure isn’t in the data — it’s in our knowledge about it.

At DKL, we’ve experienced this firsthand in contexts where the pressure to deliver quick, visible results often eclipses the need to revisit fundamentals. Some representative cases include:

Dashboards hiding sampling bias. In more than one project, aggregated visualizations — seemingly clean and consistent — masked deep biases in how data was collected or selected. For example: time series excluding weekends, sensors active only during specific shifts, “raw” data already transformed, or segmentations that excluded relevant subsets. The result was a distorted view of reality, driving misguided executive decisions.
Models trained on inconsistent data with no visible alerts. In environments with periodically updated datasets, we’ve seen small structural changes — such as new unaccounted-for categories, shifting encodings, or redefined fields — go unnoticed due to a lack of systematic validation. Models continued to train without apparent errors but progressively degraded in performance, and in some cases, even reversed behavior in specific segments without anyone noticing in time.
Automated decisions based on variables whose meaning degraded along the pipeline. This occurs, for instance, when a binary variable is converted to a numeric value to simplify calculations, or a categorical field is replaced with arbitrary codes that lose their original meaning. Without semantic traceability, the business rules that depend on those fields continue to run — but their logic remains hollow. What appeared to be a robust flow was, in reality, making decisions based on data disconnected from its original intent.

These aren’t exceptional mistakes — they’re systemic symptoms of a discipline that still hasn’t fully integrated uncertainty management into the data lifecycle. In most industries where data-driven decisions affect people and organizations, such issues aren’t anecdotal: they represent genuine operational and strategic risks.

A real example: invisible uncertainty

While catastrophic data failures are widely documented, far more common are the everyday “small” issues that seem harmless at first but reveal their scale when viewed collectively.

One example from our own team illustrates this: in a project estimating electricity demand, a critical variable — customer type — contained silently imputed values in about 7% of records during data loading. The standard EDA didn’t detect this anomaly. For weeks, the model performed acceptably until it suddenly failed to predict consumption in a specific rural area. A later review uncovered that these “silent imputations” had biased predictions for a key customer segment. Unmanaged uncertainty turned into a significant business error. To make matters worse, the problematic functionality was part of a standard data processing library used routinely across multiple projects.

Our commitment: reliability, rigor, and responsibility

At Deep Kernel Labs, we believe that managing uncertainty is not a technical luxury — it’s an ethical and strategic imperative if we aim to build reliable and responsible solutions. We take the impact of our technological decisions seriously, especially in contexts where errors can have real consequences.

That’s why we:

Develop proprietary tools to assess uncertainty.
Integrate traceability and data governance practices into all our workflows.
Train our teams to identify, quantify, and communicate these invisible risks.

Where does uncertainty come from?

Uncertainty in data doesn’t appear suddenly or in isolation. It emerges, accumulates, and spreads throughout the data lifecycle. It’s often not the result of a single failure, but a constellation of omissions, assumptions, and undocumented decisions that go unnoticed in environments where speed trumps rigor.

In our experience, uncertainty is rarely a technical problem at its root — it’s epistemological and organizational. We don’t know what we think we know; we lack a shared understanding of what data means; and we have no precise mechanisms to audit that ambiguity. When such mechanisms do exist, they’re fragile, scattered, or relegated to late stages of the process.

One of the most critical — and underestimated — factors is the absence of formal definitions or initial references against which data validity can be evaluated. Without that baseline, how can we detect anomalies if we don’t know what normal looks like? How can we tell a valid value from a questionable one if no domain of reference was ever made explicit? This semantic and structural void is a significant source of uncertainty — and unless it’s addressed, everything that follows (EDA, modeling, evaluation) rests on unstable ground.

Moreover, in complex organizations, data flows are rarely linear: multiple teams, tools, and transformation layers create a fragmented reality. In such scenarios, no one sees the entire picture, and accountability for data quality becomes unclear.

Below are the most recurrent sources of uncertainty we’ve identified in our projects:

arrow_circle_right

Lack of critical awareness

EDA has degraded into an informal ritual, dependent on each analyst’s style, without transparent methodology or coverage guarantees. Many significant uncertainties simply go unseen because no one is looking for them.

arrow_circle_right

Absence of definitions and references

No clear criteria are set for what constitutes a valid datum, expected domains, or the semantics of each field. Without shared references, there’s nothing to compare reality against.

arrow_circle_right

Fragmented, untraceable processes

Responsibility for data quality is dispersed across technical, analytical, and business roles, lacking a unified vision or systematic approach to capture and communicate uncertainty. The focus remains on delivering results, not understanding their limits.

arrow_circle_right

Inadequate tools

Current notebooks, scripts, and pipelines often lack explicit functionality to represent, quantify, or trace uncertainty, leaving teams without adequate infrastructure to manage it effectively.

The outcome is predictable: analyses and models that look solid but rest on unacknowledged layers of ambiguity and error. Ignoring uncertainty doesn’t make it go away — it merely hides it until it surfaces as a failure, bias, or poor decision.

What Now?

If you work with data and share our concern, we invite you to reflect, share this article, and start a conversation.

Managing uncertainty isn’t just a good technical practice — it’s an essential part of doing our job well.

Blind trust as a source of risk

While disciplines like physics and statistics have explicitly addressed uncertainty for decades, data science often places excessive trust in the immediate source of data — the system that captures it, the expert who defines it, or the tool that extracts it. It’s assumed that after some light cleaning and routine EDA, the data is “good enough.”

That trust becomes a problem when:

No reference framework is defined for verifying data quality and consistency.
Internal transformations and their effects are ignored.
Hidden assumptions in business rules go unexamined.
Cross-validation between sources is omitted.

Consequently, we have systems that seem reliable but are founded on hidden layers of ambiguity and error. And when a model informs critical decisions — like tuning an electrical grid, issuing a medical diagnosis, or approving a loan — unmanaged uncertainty ceases to be a technical risk and becomes an ethical one.

Bad news: Generative AI multiplies uncertainty

Despite its transformative potential, the widespread adoption of generative models has brought a paradoxical and under-discussed consequence: we have industrialized the production of uncertainty.

For the first time, we are deploying systems at scale that don’t merely process existing data — they create new content through statistical–linguistic inference, without explicit anchoring in verifiable sources or formal reference structures.

This content, often indistinguishable from genuine information, presents a surface of apparent precision that conceals its fundamentally probabilistic nature.

In other words, GenAI doesn’t just operate under uncertainty — it produces it by design. And it does so in real-time, at high speed, in increasingly sensitive contexts, including customer service, financial decisions, medical diagnostics, knowledge generation, and user assistance.

What’s most concerning is that in most cases:

Generated outputs don’t communicate their confidence level.
Plausible inferences are indistinguishable from verified facts.
Internal decision mechanisms are opaque and inaccessible to most users.

This new paradigm doesn’t replace the uncertainty that already existed in data — it multiplies it, disguises it, and distributes it across every layer of the workflow.

Rather than reducing the problem, generative models have amplified it. Without proper mitigation measures, we risk building increasingly sophisticated systems on ever more diffuse foundations.

A methodological proposal for addressing uncertainty

By now, it should be clear that the problem of uncertainty in data doesn’t vanish through goodwill or informal processes. It requires an explicit methodology — one capable of identifying, measuring, tracking, and reducing its impact across the entire data lifecycle.

Without a systematic approach, teams end up working blindly, relying on inconsistent structures, individual intuition, or inherited habits that prioritize delivery over verification. This is how models are built on unexamined assumptions, and results are reported without acknowledging their limitations.

The persistent absence of reference frameworks worsens the issue. In too many projects, analysis begins without first defining what’s expected of the data, what values are valid, what formats are correct, or what transformations are permissible. In that semantic void, everything “seems to work” — until it doesn’t. Therefore, any methodology aiming to manage uncertainty must, without exception, begin by establishing explicit references to relevant information.

At DKL, we’ve learned that tools and ad hoc validations are not enough. What’s needed is a practical, auditable, and adaptable methodology — one that professionalizes EDA, integrates uncertainty as a core analytical variable, and turns the invisible into something measurable and manageable.

Below are the five phases we consistently apply in our projects. This framework is not intended as a fixed standard, but rather as a living approach with a central goal: to make data as trustworthy as the decisions we want to make with it.

Phase 4: Communication and traceability

Include uncertainty metrics in reports and dashboards.
Tag variables by reliability and completeness.
Document transformations and version schemas.
Audit the origin and evolution of each key datum.

Phase 1: Definition of references and expectations

Before analyzing any dataset, establish:

Formal definitions for each field.
Domains of valid values (closed lists, ranges).
Expected structure: types, keys, formats, units.
Validation rules and explicit assumptions.

Without this phase, there’s nothing against which to measure data quality.

Phase 2: Structured diagnosis

Once references are defined:

Validate real data against expectations.
Detect deviations in types, domains, and structures.
Assess coverage and consistency without assuming correctness.

Phase 3: Quantification of uncertainty

Measure, for each variable or dataset:

% of missing values.
% of incomplete records.
Entropy of categorical variables.
Gaps and anomalies in time series.
Frequency of imputations and implicit transformations.
Synthetic uncertainty index per field or dataset.

Phase 5: Resilient and ethical design

Professionalize EDA as a formal process, not a checklist.
Design pipelines that track and report uncertainty.
Simulate data degradation scenarios.
Train teams to work with uncertainty, not against it.

Uncertainty in data: a silent risk and strategies for managing it

The challenge: data that looks solid but hides uncertainty

A real example: invisible uncertainty

Our commitment: reliability, rigor, and responsibility

Where does uncertainty come from?

Lack of critical awareness

Absence of definitions and references

Fragmented, untraceable processes

Inadequate tools

What Now?

Blind trust as a source of risk

Bad news: Generative AI multiplies uncertainty

A methodological proposal for addressing uncertainty

Phase 4: Communication and traceability

Phase 1: Definition of references and expectations

Phase 2: Structured diagnosis

Phase 3: Quantification of uncertainty

Phase 5: Resilient and ethical design