We build things. We also think about them.
Alongside client work and product development, the AI Ceylon team runs structured experiments, publishes findings, and maintains honest thinking about applied AI. Not academic. Not marketing. Just what we actually learn.
What we research
Problems we encounter building production AI systems — not theoretical exercises. If it burned us or surprised us, we study it.
How we work
Structured experiments with real tasks and real data. We measure against custom evals, not public benchmarks. We document failure as thoroughly as success.
Why we publish
The applied AI field moves faster when practitioners share honestly. We publish to improve the ecosystem — and to hold ourselves accountable to evidence.
Six domains we study with rigour
Not trends. Problems we encounter repeatedly in production — and believe are worth understanding deeply.
Decision Intelligence
How AI can augment complex business decisions without replacing human judgment. We study frameworks, approval flows, confidence scoring, and institutional memory.
LLM Reliability & Evaluation
Building task-specific eval suites that go beyond public benchmarks. How to measure what actually matters for your production system — and how to detect when it drifts.
Autonomous AI Agents
When to use agents versus simpler pipelines. How to design reliable, observable agents with human override gates that don't become bottlenecks.
Conversational Data Interfaces
Natural language to SQL and analytics — where it works reliably, where it fails, and how to recover gracefully. Practical findings from building Chat2Data.
Document Understanding
Classification, extraction, and routing from unstructured business documents. Edge case handling, confidence thresholds, and when to escalate to human review.
Human-AI Collaboration
Design patterns for human oversight that stay effective without slowing teams down. Building trust incrementally — earning automation, not assuming it.
What we're currently testing
Active experiments running alongside product and client work. Some become findings. Some become products. Some just inform how we build.
We run lightweight, time-boxed experiments — typically two to four weeks — against real tasks drawn from production systems or client problem spaces. Results are documented internally and selectively published when they're generalisable enough to be useful outside our context.
What we've written and published
Long-form findings from our research. Practical, direct, with working code and real data where we can share it.
Research that becomes real software
Some experiments produce findings we publish. Others reveal problems worth solving at a product level. This is how research becomes products at AI Ceylon.
Observe
A recurring friction pattern in client work or our own products
Experiment
Run structured tests, measure against real tasks, not benchmarks
Pattern
Document repeatable findings and failure modes across contexts
Ship
Build it into a product or publish what we learned openly
Products that came from this process
Decisio
LiveEmerged from research into how AI can assist complex decisions without replacing the people who own them.
Chat2Data
BetaGrew from our conversational data interface research — specifically, making natural language to SQL reliable enough for non-technical business users.
Next product
Currently in the experimentation phase. ETA TBD.
How we think about building with AI
Not a manifesto. Five things we've learned the hard way — principles that shape how every AI system we build actually works.
01
Deploy first, perfect later
A good model in production beats a perfect model in development. We bias toward shipping, observing real behaviour, and improving from evidence — not from intuition.
Every AI Ceylon product launched in an imperfect state and improved in weeks, not quarters.
02
Uncertainty is a feature
We build systems that express confidence levels, not systems that fake certainty. A model that says 'I'm not sure' is more useful than one that guesses with authority.
Decisio shows confidence scores on every AI recommendation. Low confidence surfaces to human review.
03
Benchmarks are marketing
Public leaderboards tell you almost nothing about how a model performs on your specific task, data, and edge cases. We build task-specific evals before we build the system.
Our client evals have outperformed 'state of the art' models on domain-specific tasks by 20–40%.
04
The human loop is load-bearing
We don't remove human oversight until the system has earned it. Trust is earned incrementally, through proven performance on real data — not assumed from capability demos.
Every agentic workflow we deploy starts with full human approval. Automation gates open over time.
05
Complexity is a liability
The simplest architecture that reliably solves the real problem wins. We resist the urge to over-engineer, chain models unnecessarily, or add orchestration that adds failure surface.
Three of our five production systems run on a single LLM call with structured output. No agents needed.
Future research directions
Four areas we're building knowledge in — not yet at the findings stage, but consistently present in our experimentation and client work.
Multi-agent coordination
Moving from single agents to reliable teams of specialised agents — with observable handoffs, shared memory, and coherent failure modes.
Domain-adapted models
Whether fine-tuning small models beats prompting large ones for specific industries. We're collecting evidence across financial services, healthcare, and logistics.
AI-native UI paradigms
Rethinking interfaces for systems that reason — not just systems that compute. How conversation, confidence, and explanation change product design fundamentally.
Real-time AI observability
Production monitoring that goes beyond uptime to reasoning quality — detecting drift, catching regressions, and alerting on logic failures before users do.