Seven Skills That Actually Matter When You Build Agents
“Agent” is an overloaded word. In practice, shipping something users trust is less about which model you picked and more about the system around it: how data flows, how tools are defined, how failures surface, and how you know it worked.
Below are seven skill areas I keep coming back to—grounded in things I’ve actually built or shipped in this repo’s world: Fueld (multi-agent nutrition and voice onboarding), Barrister Wasabi (legal voice intake with large tool surfaces), Senior (save-triggered diff analysis via a local Rust daemon), VibeCheck / Doppel (on-device models and dual RAG), Steezn (context-aware recommendations), DeAlgo Coach (voice + evaluation loop), and older full-stack work like Floodfix (oracle-driven correctness).
1. System design
Agents are distributed systems with a charismatic front end. You are designing pipelines, not monologues: speech capture, model calls, tool execution, persistence, billing, and fallbacks all have different latency and failure modes.
What this looked like in my work:
- Fueld: Multiple models and integrations (voice, vision, reporting) had to stay within a cost and latency budget per user—not a lab benchmark.
- Senior: A deliberate split—a TypeScript editor extension, a Rust daemon, and local inference—so feedback stays fast and data stays on the machine.
Skill to grow: Draw the boxes and arrows before you tune prompts. If you cannot name what happens when step three times out, you do not have an agent—you have a demo.
2. Tool and contract design
The model is only as safe as the API surface you expose. Tools are contracts: names, arguments, side effects, idempotency, and what “success” means.
Examples:
- Barrister Wasabi: Dozens of client-side tools for voice-driven forms only work if each tool has a tight job—otherwise the model improvises and the UI state diverges.
- MCP-style patterns (as in Fueld’s stack): Standardising how the model reaches external capabilities keeps orchestration replaceable—swap models without rewriting every integration.
Skill to grow: Design tools like a public SDK. Version them. Document error shapes. Prefer many small tools over one “do everything” tool that becomes a second language model.
3. Retrieval engineering
When the agent needs memory, retrieval is a product feature, not a config toggle. Chunking, embeddings, hybrid search, freshness, and permission boundaries matter as much as the chat layer.
Examples:
- VibeCheck (Doppel): Dual RAG (keyword + semantic) on-device is a bet that different queries need different recall strategies—especially when nothing leaves the device.
- Steezn: “Outfit with weather / activity” is retrieval under constraints—context signals have to join the prompt in a structured way, not as a wall of text.
Skill to grow: Measure retrieval with task-specific evals (hit rate, wrong-doc rate), not cosine similarity vanity metrics.
4. Reliability engineering
Agents fail in boring ways: timeouts, partial JSON, race conditions, duplicate tool calls, and human abandonment mid-flow. Reliability is retries, backoff, state machines, and degraded modes.
Examples:
- Steezn: A resilient service layer with caching and retry patterns for AI and search calls—because flaky third parties are the default.
- DeAlgo Coach: Voice paths need browser fallbacks; code execution needs stdout/stderr discipline—users should see deterministic failure, not silent hangs.
- Fueld: Shortening iteration cycles (e.g. fine-tuning / training pipelines) is reliability for the team—faster loops mean you fix behaviour before users memorise the bugs.
Skill to grow: Define SLOs for the agent as a whole: p95 time-to-first-token, tool error rate, and “conversation completed successfully.”
5. Security and safety
Safety is not only “filter bad words.” It is data minimisation, least-privilege tools, auditability, and domain-specific rules (especially regulated or intimate contexts).
Examples:
- VibeCheck: On-device inference—no cloud round-trip—is an architectural safety argument for sensitive behavioural data.
- Barrister: UK-specific qualification logic (e.g. limitation thinking) belongs in deterministic code, not in the model’s mood that day.
Skill to grow: Classify data before it touches a model. Separate “creative language” from “legal or medical consequence” and put the latter behind code and tests.
6. Evaluation and observability
If you cannot see what the agent did, you cannot improve it. Logging tool calls, tracing multi-step flows, and running offline evals are non-optional at scale.
Examples:
- Senior: Git diff → impact summary is a form of observability for the human: what changed, what it might break, what to check next—before CI runs.
- Fueld / production voice flows: Completion rate, accuracy of captured fields, and unit economics per run are the metrics that tell you if the system is real.
Skill to grow: Build a small “golden set” of conversations or tasks and regression-test them on every prompt or tool change.
7. Product thinking
The best agent architecture in the world loses to a product that respects user context. Onboarding length, persona choice, channel (voice vs form), and when to ask for confirmation are product decisions.
Examples:
- Fueld: Moving health intake from long forms to voice-first flow is a bet on friction, not flair—completion and accuracy are the scoreboard.
- Floodfix (older work): Oracle-driven payouts only work if users understand when trust is mathematical vs narrative—same class of problem as “why should I believe this agent?”
Skill to grow: Prototype the failure UX first. What does the user see when the model is wrong—and how fast can they recover without feeling stupid?
Closing
Agents are not a model choice. They are systems made of design, contracts, retrieval, reliability, safety, measurement, and product judgement. If you are serious about the space, invest evenly across all seven—weakness in any one of them shows up in production as “the AI felt flaky,” which is almost never the model’s solo fault.