Garbage In, Genius Out? What AI Can Do With Messy Data Today
- The Professor

- Nov 20
- 7 min read
Dek: Many organisations still feel they must perfect their data before they start with AI. Modern AI challenges that assumption, and the middle ground is where real value emerges.

1. Opening: AI data quality
How many times have you heard the phrase “garbage in, garbage out”? It shaped business thinking for decades, rooted in more than 70 years of AI practice built on symbolic and statistical methods that needed structured, tightly controlled data to function (Russell & Norvig, 2021). At a conference this week, a speaker repeated the usual line: AI is impossible until data is fully cleaned and perfectly structured. The room fell silent. You could feel the mood change. No one questioned the claim; it was accepted as fact.
But is it still true? Or has AI moved us beyond the idea that data must be pristine before we can make progress with AI data quality and AI strategy?
2. Why the traditional view still exists
Takeaway: High data maturity provides control, but it also slows adoption.
From the 1950s through to the early 2000s, organisations were taught that AI systems behaved like traditional software: they needed structured tables, consistent labels and carefully controlled inputs. Early machine learning models were unforgiving. When data was noisy or incomplete, performance degraded sharply (Whang et al., 2023). If fields were missing, the model struggled. If the data were inconsistent, the results were skewed. This created a culture in which teams prioritised databases, taxonomies, and catalogues long before anyone could use the data.
Regulation reinforced this mindset. In banking, BCBS 239 set strict expectations for accurate, timely and traceable risk data (Basel Committee on Banking Supervision, 2013). In health, guidance on audit trails and clinical data quality emphasised governance and reliability (European Medicines Agency, 2023). Across the wider public sector, frameworks highlighted the importance of well-managed, trustworthy data (Government Data Quality Hub, 2020). With the technology of the time, the only way to achieve these expectations was through high data maturity.
Traditionalists are not wrong. For high-stakes functions, this approach is still necessary. But the logic is now applied too broadly, even in low-risk areas where modern AI can add value with imperfect, messy data.
3. The modern view: AI can work with organic data
Takeaway: Today’s models can operate with real-world messy data, but outcomes vary with noise.
AI models such as ChatGPT, Claude, Gemini and DeepSeek R1 are built to work with text, screenshots, logs, PDFs and other unstructured sources, reflecting advances in self-supervised learning (LeCun, 2021). These models learn from patterns across vast volumes of data, making sense of contradictions, filling gaps and drawing meaning from context. They work more like people, able to infer, compare and reason, rather than requiring rigid inputs.
A practical example: we built a student-facing chatbot using a messy web page with dozens of documents, some outdated and some current. The model navigated inconsistencies, produced helpful answers and improved rapidly. When it made mistakes, we pointed it to the correct information. The data never changed, but the model’s performance did.
Modern AI enables organisations to test ideas earlier, move faster, and generate value without multi-year data-cleansing projects. Models can spot anomalies the human eye would miss, drawing on decades of work in machine learning based anomaly detection (Chandola, Banerjee, & Kumar, 2009).
However, flexibility brings trade-offs. Models behave unpredictably when input becomes extremely noisy (Whang et al., 2021). Irregular or biased datasets can push outputs in skewed directions (Shome et al., 2022). Because reasoning is statistical rather than rule-based, explanations are sometimes difficult in regulated environments. This is why messy data still poses risks in high-stakes contexts, even though modern models cope better than earlier systems.
Research continues to show that poorly curated or biased data undermines performance and trustworthiness (Whang et al., 2023; Shome et al., 2022). AI data quality still matters; it is just not the only story.
4. The real middle ground: good enough, with guardrails
Takeaway: You can start with imperfect data, provided the risk is managed.
Most mature organisations now follow a pragmatic middle path:
You do not need flawless data. You need data that is fit for purpose, and governance that matches the risk.
Teams begin with specific use cases where imperfect data is acceptable. As AI surfaces gaps or inconsistencies, issues are fixed through real usage rather than through separate, large-scale initiatives. Each deployment reveals what matters: some datasets are messy but harmless, others need attention, and a small number are critical for compliance or accuracy.
This approach creates a feedback loop. Higher-risk areas receive stronger oversight, while lower-risk areas move quickly. It mirrors guidance such as the UK Government Data Quality Framework, which defines quality as “fitness for purpose” and emphasises continuous improvement (Government Data Quality Hub, 2020).
The outcome is early value, clearer insight into your data estate and steady improvements in data maturity without stalling innovation.
5. Different use cases demand different data quality
Below is an expanded table with real-world scenarios that link AI data quality to business risk.
Use case | Data quality needed | Risk level | Example | Why |
Drafting reports, summaries | Low | Low | Summarising internal meeting notes | Humans easily spot issues; no compliance risk. |
Customer service automation | Medium | Medium | AI-supported responses to student queries | Mistakes affect experience but not safety. |
Internal policy assistance | Medium | Medium | Chatbot using mixed-quality policy documents | Requires some consistency but tolerates noise. |
Finance, compliance, safety critical | High | High | Regulatory reporting, fraud detection | Errors have legal or financial consequenc; an audit is needed. |
Takeaway: Match expectations to the stakes.
A rough data lake is fine for internal analysis, draft content and early AI prototypes. Summarising customer emails, automating responses to internal queries or producing board paper drafts are all suitable low-risk applications for AI with messy data. High-stakes work with regulatory, financial or safety consequences needs better-governed data and stronger AI controls.
6. So who is right?
Both sides are right, but for different reasons. Traditionalists protect the organisation. Modernists focus on speed and value.
A more helpful way to think about AI data quality:
“Fit for purpose in, controlled result out.”
What this means for you
Start with AI now: choose one low-risk use case and launch a pilot within 30 days.
Let AI help refine your data: use model errors to identify genuine data issues rather than guessing.
Increase data maturity as value builds: focus cleaning efforts where usage reveals need, not everywhere at once.
Reserve high-quality data work for high-stakes cases: use a simple risk matrix to classify use cases and match data quality to risk.
7. FAQ
How do I know if my data is “good enough”?
Ask three questions. What is the risk if the AI is wrong? Can humans review or override the output? Will the AI help highlight data problems? If risk is low and oversight is possible, the data is likely suitable for a first pilot.
Will using messy data create technical debt?
It can, but waiting creates opportunity cost. Start small in low-risk areas and let real usage show you where data quality genuinely matters. Then you invest cleaning effort where it has the most impact.
What about regulated work?
Regulated environments in finance and health and safety still demand higher-quality data and stronger governance. The “good enough” approach applies to internal tools, content generation, analysis and other low-risk uses.
8. Final thought
Data maturity still matters, but it should not be a barrier to starting. Modern AI can work with noisy, incomplete and inconsistent data. Use AI to improve your data and your understanding of where quality really matters, rather than waiting for an imaginary future where everything is perfect.
9. From the professor’s desk
Across the projects I have advised, the biggest slowdown comes from teams waiting for an idealised future state of perfect data. Colleagues often argue that a major multi-year transformation is required before any AI work can begin. This belief slows progress, even when the use case is low risk and the technology could start adding value immediately. When AI is integrated into real workflows, it reveals the data that genuinely matters and that which does not. That clarity is worth more than any theoretical programme. Start with what you have, and improve as you go.
10. Further reading
The debate about AI and data quality is far from settled. Some authors argue that AI can extract value from imperfect datasets. Others warn that weak data foundations still undermine performance. Together, the literature shows that perfection is not required, but proportionate controls are.
Evidence base
Source | Position | Summary |
Mohammed et al. (2022) | Challenges | Shows noisy or incomplete data weakens machine learning performance across models. |
Ni et al. (2023) | Refines | Suggests perfect data is not always optimal; some models perform better with selectively cleaned data. |
Lukyanenko (2025) | Refines | Redefines data quality as “fitness for purpose,” aligning with a practical middle ground. |
Schwabe et al. (2024) | Refines | Introduces a multi-dimensional framework for assessing data quality by task sensitivity. |
Foidl et al. (2022) | Supports | Explores “data smells,” subtle indicators of deeper data quality issues that AI can surface. |
Slota (2020) | Challenges | Critiques the idea that AI can compensate for poor data quality; warns that hype masks real risks. |
UK Finance (2023) | Challenges | Argues that high-quality structured data is essential, especially in regulated banking. |
Martins et al. (2025) | Challenges | Benchmark data cleaning tools that deliver reliable outcomes still require significant effort. |
ACM (2024) | Challenges | A survey showing many machine learning tasks still depend on accurate labels and consistency. |
Industry sources (Domo etc.) | Mixed | Practical pieces; some support early use of messy data, others highlight risks and limitations. |
References (APA)
Basel Committee on Banking Supervision. (2013). Principles for effective risk data aggregation and risk reporting (BCBS 239). Bank for International Settlements.
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3), 1–58.
European Medicines Agency. (2023). Guideline on computerised systems and electronic data in clinical trials.
Government Data Quality Hub. (2020). The Government Data Quality Framework.
LeCun, Y. (2021). Self supervised learning: The dark matter of intelligence. Meta AI.
Russell, S., & Norvig, P. (2021). Artificial intelligence: A modern approach (4th ed.). Pearson.
Shome, A., Cruz, L., & van Deursen, A. (2022). Data smells in public datasets. In Proceedings of the 1st International Conference on AI Engineering (CAIN 2022).
Whang, S. E., Roh, Y, Song, H., & Lee, J.-G. (2023). Data collection and quality challenges in deep learning: A data centric AI perspective. The VLDB Journal, 32(4), 789–813.
Mohammed, S. et al. (2022). The effects of data quality on machine learning performance. arXiv.
Ni, W., Miao, X., Zhao, X., Wu, Y., & Yin, J. (2023). Automatic data repair: Are we ready to deploy? arXiv.
Lukyanenko, R. (2025). What is data quality? Defining data quality in the age of AI.
Schwabe, D. et al. (2024). The METRIC framework for assessing data quality. Nature.
Foidl, H., Felderer, M., & Ramler, R. (2022). Data smells: Categories, causes and consequences. arXiv.
Slota, S. C. (2020). Good systems, bad data? Interpretations of AI hype and data quality issues.
UK Finance. (2023). Good AI without good data? Don’t bank on it.
Martins, P., Cardoso, F., Vaz, P., Silva, J., & Abbasi, M. (2025). Performance and scalability of data cleaning and preprocessing tools. Data.
ACM. (2024). A survey of data quality requirements that matter in machine learning.




Interesting