Research & Experimentation

We build things. We also think about them.

Alongside client work and product development, the AI Ceylon team runs structured experiments, publishes findings, and maintains honest thinking about applied AI. Not academic. Not marketing. Just what we actually learn.

6Active research areas

2Products from research

What we research

Problems we encounter building production AI systems — not theoretical exercises. If it burned us or surprised us, we study it.

How we work

Structured experiments with real tasks and real data. We measure against custom evals, not public benchmarks. We document failure as thoroughly as success.

Why we publish

The applied AI field moves faster when practitioners share honestly. We publish to improve the ecosystem — and to hold ourselves accountable to evidence.

Focus Areas

Six domains we study with rigour

Not trends. Problems we encounter repeatedly in production — and believe are worth understanding deeply.

Decision Intelligence

How AI can augment complex business decisions without replacing human judgment. We study frameworks, approval flows, confidence scoring, and institutional memory.

LLM Reliability & Evaluation

Building task-specific eval suites that go beyond public benchmarks. How to measure what actually matters for your production system — and how to detect when it drifts.

Autonomous AI Agents

When to use agents versus simpler pipelines. How to design reliable, observable agents with human override gates that don't become bottlenecks.

Conversational Data Interfaces

Natural language to SQL and analytics — where it works reliably, where it fails, and how to recover gracefully. Practical findings from building Chat2Data.

Document Understanding

Classification, extraction, and routing from unstructured business documents. Edge case handling, confidence thresholds, and when to escalate to human review.

Human-AI Collaboration

Design patterns for human oversight that stay effective without slowing teams down. Building trust incrementally — earning automation, not assuming it.

Experimentation

What we're currently testing

Active experiments running alongside product and client work. Some become findings. Some become products. Some just inform how we build.

We run lightweight, time-boxed experiments — typically two to four weeks — against real tasks drawn from production systems or client problem spaces. Results are documented internally and selectively published when they're generalisable enough to be useful outside our context.

Active experiment areas

01Context window optimization for long-document tasks

02Multi-step reasoning chains vs. single-shot prompting

03Structured output reliability under distribution shift

04RAG retrieval quality measurement in production

05Agent tool-use error recovery patterns

06Cross-model evaluation frameworks

07Latency–accuracy trade-offs in real deployments

08Hybrid routing: rule-based triggers + LLM inference

09Confidence calibration for business-critical outputs

10Prompt brittleness detection and hardening

Published Findings

What we've written and published

Long-form findings from our research. Practical, direct, with working code and real data where we can share it.

Coming soon

We're working on our first published findings. Check back soon — we'll share what we learn openly.

In progress

Product Incubation

Research that becomes real software

Some experiments produce findings we publish. Others reveal problems worth solving at a product level. This is how research becomes products at AI Ceylon.

Observe

A recurring friction pattern in client work or our own products

Experiment

Run structured tests, measure against real tasks, not benchmarks

Pattern

Document repeatable findings and failure modes across contexts

Ship

Build it into a product or publish what we learned openly

Products that came from this process

Decisio

Live

Visit

Emerged from research into how AI can assist complex decisions without replacing the people who own them.

Chat2Data

Beta

Visit

Grew from our conversational data interface research — specifically, making natural language to SQL reliable enough for non-technical business users.

Next product

Currently in the experimentation phase. ETA TBD.

Practical AI Thinking

How we think about building with AI

Not a manifesto. Five things we've learned the hard way — principles that shape how every AI system we build actually works.

Deploy first, perfect later

A good model in production beats a perfect model in development. We bias toward shipping, observing real behaviour, and improving from evidence — not from intuition.

→

Every AI Ceylon product launched in an imperfect state and improved in weeks, not quarters.

Uncertainty is a feature

We build systems that express confidence levels, not systems that fake certainty. A model that says 'I'm not sure' is more useful than one that guesses with authority.

→

Decisio shows confidence scores on every AI recommendation. Low confidence surfaces to human review.

Benchmarks are marketing

Public leaderboards tell you almost nothing about how a model performs on your specific task, data, and edge cases. We build task-specific evals before we build the system.

→

Our client evals have outperformed 'state of the art' models on domain-specific tasks by 20–40%.

The human loop is load-bearing

We don't remove human oversight until the system has earned it. Trust is earned incrementally, through proven performance on real data — not assumed from capability demos.

→

Every agentic workflow we deploy starts with full human approval. Automation gates open over time.

Complexity is a liability

The simplest architecture that reliably solves the real problem wins. We resist the urge to over-engineer, chain models unnecessarily, or add orchestration that adds failure surface.

→

Three of our five production systems run on a single LLM call with structured output. No agents needed.

What We're Watching

Future research directions

Four areas we're building knowledge in — not yet at the findings stage, but consistently present in our experimentation and client work.

Multi-agent coordination

Moving from single agents to reliable teams of specialised agents — with observable handoffs, shared memory, and coherent failure modes.

Watching

Domain-adapted models

Whether fine-tuning small models beats prompting large ones for specific industries. We're collecting evidence across financial services, healthcare, and logistics.

Watching

AI-native UI paradigms

Rethinking interfaces for systems that reason — not just systems that compute. How conversation, confidence, and explanation change product design fundamentally.

Watching

Real-time AI observability

Production monitoring that goes beyond uptime to reasoning quality — detecting drift, catching regressions, and alerting on logic failures before users do.

Watching

Work with a team that thinks before it builds

Our research shapes how we approach every client engagement. If rigour and honesty matter to you, we should talk.