In AI projects, addressing data gaps in AI is essential for improving data readiness and ensuring reliable AI data quality. The challenge is not only that “bad data leads to bad models,” but that the very idea of what counts as “bad” is often unclear from the start. The popular saying “garbage in, garbage out” oversimplifies the reality: before training begins, teams must first confront how to define and assess the quality and completeness of their data.
Missing data complicates this further, not just because data points are absent, but because we often don’t even know what is missing or whether it can be meaningfully captured at all. Much like Henry Ford’s observation that the era of cars, people could only imagine “faster horses” instead of envisioning a car, teams frequently lack the language or perspective to see gaps in their data that point to entirely new solutions.
The problem, then, is twofold: identifying what “crap” really means in the context of data and learning how to see the invisible – those missing pieces that, if overlooked, can undermine the entire AI initiative before it begins.
What Problem Do Data Gaps in AI Create?
These issues show up differently depending on the kind of project you’re tackling. Broadly, there are two archetypes:
(1) you already have a system and need to make it better with data; or
(2) you’re designing a new system from scratch and must decide what data to collect at all. In the following sections, we will explore those two approaches and talk about frameworks to make those approaches work.
When we already have a system
When a system exists in an AI context, the biggest issue with providing required data is a tangle of implicit obstacles: data captured differently across teams, shifting field meanings, identifiers that do not align, mixed formats, gaps and missing values, and unclear ownership. In such environments, “bad data” is less a single flaw than a pattern of fragmentation that hides what the system knows.
An ideal approach would be to begin with a structured overview of what’s already there and let that picture reveal the weak points. In practice it is not so simple to gather such report, but quantitative readiness frameworks (for example, AIDRIN‑style metrics) can help to verify the condition of the data. The combination surfaces where formats or semantics diverge, which cohorts are thin, where labels look suspect, and which variables would be worth improving first.
We will dive deeper into this metric in the following section, but for now, what is important is that seen this way, improvement is less about grand re‑platforming and more about untangling a supply chain: once the map exists, harmonising identifiers, converging on timestamp conventions, or adding minimal metadata can have outsized effects on reliability and reuse. The goal is not prescriptive checklists but a shared, quantified view of reality that makes the next sensible interventions obvious.
Designing a system from scratch
The core problem – we don’t now what data we need (yet)
At the start of a completely new AI project,” bad data” is not only a matter of poor quality; it is uncertainty about relevance. We often do not know which variables will influence the decision or metric that matters. The early task, therefore, is not to clean everything in sight but to discover what information is valuable.
If we look at the literature, particularly at Value of Information (VoI), we can see a clear way to tackle the question of which variables we really need. In plain terms, VoI asks: if we knew this thing perfectly, would we make a better decision? It pushes us to work backwards from the decision the model will support, list the candidate variables, and rank them by how much they could change that decision. Practically, that becomes a short High/Medium/Low list that we pressure‑test with small trials.
Another interesting approach is Active Learning and Optimal Experimental Design (OED), where the emphasis shifts to learning by doing rather than guessing up front. Active learning is like asking a teacher only the questions that will teach you the most: from a pool of unlabelled examples, label the ones the model is most uncertain about or that add the most diversity. Optimal Experimental Design is the lab‑knob analogue: when you control measurements or settings, choose them so each run gives you the most signal per unit cost. With a minimal dataset and quick baselines, simple add/remove tests show whether a variable actually moves the needle, and learning curves, overall and by cohort, separate we need more row from we need different variables. Together this becomes a rank‑first, probe‑fast loop that turns not knowing into evidence and a prioritised backlog of what to collect next.
For organisations deploying new AI solutions, it is relevant because it short‑circuits endless debate about “bad data”. Instead of arguing in the abstract, teams can agree on what decision they are trying to improve, what information would change it, and what evidence justifies the next collection step. VoI/OED/Active Learning tell you what to acquire next; the later readiness and governance steps define when the data is “good enough” and make the remaining gaps visible. In practice, that is how we answer introductory questions: we identify what “bad” means in a specific context and learn to see the invisible pieces before they derail the project.
When we know what data we need, but we don’t know how to turn it into structured and usable data
After finding out what variables are necessary to model certain task, we need to gather the dedicated data. There are many frameworks for assessing weaknesses and gaps in data condition; among them, I find two are particularly useful in practice and frame the work that follows.
AIDRIN: what’s good enough
AIDRIN mentioned in the previous section is a quantitative framework and reporting tool for assessing whether a dataset is ready for AI work. It computes metrics across core dimensions – data quality (missingness, duplicates, outliers), structure (feature redundancy and correlation), label/target properties (such as class imbalance), and responsible‑AI criteria (slice‑level fairness indicators and privacy‑risk signals). It also includes stewardship checks inspired by FAIR, covering basic metadata and provenance. The output is a visual, metric‑based report with drill‑downs that surface weak segments (for example, thin cohorts or leaky features) and make versions comparable over time. The published implementations focus mainly on tabular data, with extensions to other modalities beginning to appear.
ODI AI‑Ready Data: make it usable by ODI
ODI’s framework is publisher‑oriented and was designed with open‑data release in mind, but its ideas travel well inside organisations. It specifies the surrounding information a dataset should carry so that downstream users can evaluate and reuse it reliably. The emphasis is on technical optimisation for ML (representativeness, handling of missing/outliers, valid types and ranges), standards and metadata (schemas/ontologies, rich descriptions), provenance and versioning, and clarity on legal/ethical posture (licence, consent, usage rights, risk). Applied internally, the ODI lens improves discoverability and safe reuse by making ownership, lineage, and context explicit – areas that sit AIDRIN’s numeric checks. Frameworks such as AIDRIN and ODI help organizations close data gaps in AI by improving structure, metadata, and governance.
Put simply, “bad data” usually means data that is unfit for modelling or reuse: thin or unrepresentative cohorts, label mistakes, leaky fields, high missingness, duplicates and outliers, inconsistent types and ranges, unclear lineage, and ambiguous licensing. Making it reusable means two things. First, measure and fix the fitness issues with a quantitative pass (for example, AIDRIN’s metrics to find missingness, imbalance, redundancy, and fairness/privacy flags, then closing those gaps). Second, supply the context that lets others use the same data safely (for example, following the ODI lens to add clear schemas and metadata, provenance and versioning, plus explicit licensing and consent). While fixing the data requires dedicated tools and techniques, those two approaches are helpful in getting an overview of what even needs fixing and provide a great measure of value to what is already there.
Conclusion
For existing AI systems, the fastest path is to surface what is already there, make the fragmentation visible, and address the highest‑impact weak points. For new builds, begin with the decision, discover which data truly matters with quick tests, and build up deliberately. In both cases, a quantitative check of data fitness is followed, and the context is added that makes the datasets safe to reuse. This sequence replaces abstract debate about “bad data” with a concrete path to data that is model‑ready and dependable across the organisation.
AI projects fail less from a lack of models than from unclear data needs and brittle data foundations. By framing the decision first, discovering high‑value variables through lightweight experiments, and then raising the dataset to a publishable standard with quantitative audits and clear context, teams replace guesswork with evidence. The result is a dataset that is both model‑ready and safely reusable inside the organisation, and a process that reveals invisible gaps before they derail delivery.
Key takeaways
- Enhancing data in existing systems: map what already exists to expose fragmentation (formats, meanings, IDs, ownership) and let that structured overview reveal weak points before any big rebuilds.
- Enhancing data new builds: when starting from scratch, frame the decision and use small, low‑cost probes to learn which data helps (e.g., rank by value to the decision, test quickly).
- Quantitative view that is useful for both cases: next, quantify dataset fitness with a numeric assessment (for example, AIDRIN‑style metrics) and ensure the surrounding context, metadata, provenance, licensing makes it reusable (for example, ODI).
- Execution style: converge on a shared, measured view of reality and prefer small, targeted fixes over sweeping re‑platforms.
Both frameworks demonstrate that closing data gaps in AI is not just a technical task but a strategic approach to improving decision quality and long-term scalability. Ultimately, solving data gaps in AI is what turns information into real, measurable value.