AI-Adapted Agile / Scrum
Standard Scrum adapted for teams building AI-integrated products. AI uncertainty, prompt iteration, model evaluation, and data governance are first-class artifacts — not afterthoughts squeezed into a regular sprint.
Roles & Team Composition
Product Owner
Business- Owns AI acceptance criteria — must understand probabilistic outputs
- Defines success metrics (not just user stories, but eval thresholds)
- Prioritises the prompt backlog alongside the feature backlog
- Signs off on eval results before AI features ship to production
AI / ML Engineer
AI Core- Owns prompt engineering, version control, and eval design
- Selects and benchmarks models for each use case
- Designs the RAG pipeline, fine-tuning strategy, or agent architecture
- Investigates unexpected model behaviour in production
Software Engineer
Integration- Integrates Higain.ai SDK into the application layer
- Builds streaming UI, context management, and fallback handling
- Implements feature flags for AI rollout
- Maintains CI/CD pipeline including eval gates
Data Steward
Data- Owns data pipeline, RAG corpus, and labeling quality
- Maintains data governance docs (provenance, consent, PII audit)
- Coordinates human evaluators for eval and feedback review
- Tracks data versioning (DVC or equivalent)
QA / Eval Engineer
Quality- Owns automated eval harness — runs on every PR
- Designs and runs human eval protocols
- Red-teams AI features before each release
- Maintains eval dataset quality and prevents label leakage
Scrum Master
Process- Facilitates AI-specific ceremonies (Prompt Lab, Eval Review)
- Ensures AI uncertainty is made explicit in planning estimates
- Shields team from scope creep driven by model hype
- Tracks AI story velocity separately from standard engineering velocity
Sprint Structure (2-week cycle)
Sprint Planning
3–4 hours- Review product backlog — AI stories and standard feature stories together
- Assign token cost estimates to AI stories (Low < 1K, Med < 10K, High > 10K tokens/request)
- Write measurable eval criteria for every AI story before it enters the sprint
- Identify AI risk cards: hallucination risk, data quality risk, cost overrun risk
- Separate AI workstream tasks: Prompt, Data, Integration, Eval
Standup (15 min)
15 min- Standard: What did I do yesterday? What will I do today? Any blockers?
- AI-specific: Did any prompt or model behaviour change unexpectedly yesterday?
- Flag immediately: unexpected refusals, cost spikes, schema failures, or latency regressions
Prompt Lab Session
1–2 hours- Mid-sprint timebox dedicated to prompt iteration and experimentation
- AI Engineer + QA run variants against the eval set in a shared notebook
- Winning prompt committed to version control before session ends
- Output: updated eval scores and a CHANGELOG entry
Sprint Review
2 hours- Demo includes eval dashboard, not just feature UI — stakeholders see quality metrics
- Present: eval score before vs. after, token cost trend, any regressions and their resolution
- PO signs off using the acceptance criteria rubric defined at sprint planning
- Undone AI stories: assess whether the eval bar was too aggressive or the feature needs more work
AI Retrospective
1.5 hours- Standard retro format + AI-specific lenses:
- → Were our eval criteria realistic and well-defined?
- → Did any model behaviour surprise us? What did we learn?
- → Did we spend too much / too little time on prompt engineering vs. integration?
- → What data quality issues slowed us down? How do we prevent them next sprint?
Story Types
The standard user story format doesn't translate cleanly to AI work. Use these five story types in your backlog — each has a different estimation approach and different definition of done.
Feature Story
Feature“As a [user], I want [functionality] so that [business value].”
Story points (standard Fibonacci)
Code merged, tests passing, eval criteria met, feature flag set.
As a user, I want the AI to summarise my document in 3 bullet points so that I can decide whether to read it.
Prompt Story
Prompt“As an AI Engineer, I need [prompt change] so that [eval metric] improves by [threshold].”
Prompt complexity: Low (quick tweak) / Medium (restructure) / High (new strategy) / Research (unknown)
Prompt version committed and tagged. Eval suite shows improvement. No regression on other metrics.
As an AI Engineer, I need to add output length constraints to the summariser prompt so that ROUGE-L improves from 0.82 to ≥ 0.87.
Data Story
Data“As a [role], I need [data asset] so that [AI capability or eval goal] is achievable.”
Story points based on pipeline complexity, not volume
Data ingested, versioned, PII-audited, quality gate passing. Eval set updated if applicable.
As a Data Steward, I need 50 labelled support ticket examples so that we can establish a baseline eval for the summariser.
Eval Story
Eval“As a QA Engineer, I need [evaluation capability] so that [quality signal] is measurable.”
Story points (typically 2–5 pts)
Eval runs in CI. Results logged to eval database. Dashboard updated.
As a QA Engineer, I need an LLM-as-judge scorer for faithfulness so that RAG outputs are evaluated automatically on every PR.
Spike
Spike“Investigate [question] to inform [decision]. Time-box: [N] days.”
Not pointed — time-boxed. Maximum 1 sprint.
Written findings doc shared with team. Go/no-go recommendation made. Spike branch deleted.
Investigate whether DeepSeek-R1 outperforms Llama 3.1 70B on Nepali-language reasoning tasks. Time-box: 3 days.
Definition of Done for AI Features
An AI feature is not "done" when the code is merged. It is done when these gates pass. Every item is mandatory unless explicitly waived by the PO with written justification.
Code & Integration
- Feature code merged to main and all CI checks passing.
- Unit tests cover application logic with LLM responses mocked.
- Integration test calls real API against golden test cases.
- Feature flag configured — off by default, switchable without deployment.
- Fallback behaviour tested: what happens when the model returns an error or empty response?
Prompt & Model Quality
- System prompt committed to version control with semantic version tag (e.g. v2.3.0).
- Eval suite passes with no metric below the agreed minimum threshold.
- No regression vs. baseline on any dimension by more than 5%.
- Human spot-check: minimum 10 samples reviewed and scored ≥ 3.8 mean.
- Red team checklist completed: prompt injection, jailbreak attempts, edge inputs.
Cost & Performance
- Token cost per request measured and within ±20% of the estimate from sprint planning.
- p95 latency measured and within agreed SLA.
- Load test run at 2× expected peak traffic without degradation.
Observability & Compliance
- Production alert configured for: error rate spike, latency breach, cost overrun, refusal rate anomaly.
- Logging in place: every request logs model, tokens used, latency, and outcome (not the full prompt content unless required).
- PII audit: confirm no personal data is sent to the model or logged in cleartext.
- Data residency confirmed: all inference happening on Nepal-hosted Higain.ai infrastructure.
AI Risk Management
Maintain a dedicated AI Risk Register alongside the standard project risk log. Review at every Sprint Review. The following risk categories apply to all AI features.
| Risk Category | What it means | Early warning signals | Mitigation |
|---|---|---|---|
| Hallucination | Model confidently outputs false information | Eval faithfulness score drops; user complaints about wrong facts | RAG grounding; output citations; human review for high-stakes outputs |
| Bias | Systematically unfair outputs toward a group | Disparate error rates across demographic slices in eval | Bias audit on eval set; diverse annotators; bias-specific red teaming |
| Data Quality | Garbage in, garbage out — bad examples corrupt the model's behaviour | Eval score variance; inconsistent output quality | Data quality gates; inter-annotator agreement checks; regular audits |
| Cost Overrun | Token usage far exceeds budget, making the feature uneconomical | Token cost per request trending up; context window creeping up | Token budgets in requirements; hard limits in code; cost alerts |
| Prompt Injection | Malicious user input overrides system instructions | Unexpected instruction-following; system prompt leakage reports | Input sanitisation; delimiter isolation; output validation; rate limiting |
| Regulatory / Privacy | PII in prompts/logs; data leaving Nepal contrary to policy | Accidental PII in logs; developer error routing to foreign API | PII masking pipeline; data residency audit; all inference via Higain.ai |
Backlog Hygiene for AI Teams
- Maintain separate swim lanes: Features · Prompts · Data · Evals · Ops. Do not mix them in a single flat backlog.
- Prompt stories expire after 2 sprints if not picked up — models, context, and user expectations all shift too fast.
- Spike results must produce a written decision doc. Never close a spike with verbal-only findings.
- Track prompt engineering velocity separately. It is non-linear: 80% of gains come from the first 20% of iterations.
- Data stories carry hidden dependencies — always tag them against the governance checklist from Phase 3.