Designing Enterprise Agentic Systems

Section 1

Agent Orchestration & Frameworks

Orchestration is the central nervous system of your system. It dictates how tasks are broken down, how tools are selected, and how agents collaborate.

🔄

ReAct Loop

Observe → Reason → Act → Observe. The LLM reasons about the state, calls a tool, then processes the result — cycling until the task is done.

🗺️

Workflows vs Agents

Anthropic recommends predictable workflows for 90% of tasks (routing, chaining, parallelisation) and reserving autonomous agents for truly open-ended problems.

📊

State Machines / DAGs

Production standard. Represent workflows as directed graphs — nodes are agents/functions, edges are conditional routing logic. Enables pause, resume, and inspection.

🏗️ Managed Agents & Infrastructure

Moving from a local script to production requires a lifecycle manager, not just a deployed model.

State Persistence

Save the agent's full graph state to a database (e.g., Postgres) at every node transition so it survives server restarts and can be inspected.

Human-in-the-Loop (HITL)

Pause execution and route high-stakes actions to a human approval queue before resuming — e.g., "Can I execute this DROP TABLE command?"

Enterprise Architecture Overview

Hover each node for details.

👤 User / Client

🔒 API Gateway / Auth

🧠 Orchestrator LLM

⚡ Router
small model

🗄️ State DB
Postgres

🙋 HITL Queue
approval gate

🤖 Sub-Agent A
specialist

🔧 MCP Tools
GitHub/Slack/…

📚 RAG / Vector DB
long-term memory

Interactive: ReAct Loop Simulator

Step through the agent's reasoning cycle.

💬 User Query Received

🧠 LLM: Reason (inner monologue)

🔧 Emit Tool Call (JSON)

⚙️ Tool Executes (API / DB / code)

👁️ Observe Result → Loop or Done?

✅ Final Response to User

—

Section 2

Memory & Context Management

Context is your most precious resource. How you fill it, trim it, and cache it determines your agent's cost, speed, and quality.

⚡

Short-Term Memory

The active context window — conversation history, recent tool outputs. Limited by token budget.

Optimise: sliding window truncation or LLM-based summarisation

💾

Long-Term Memory

Episodic and semantic memory stored in vector databases, retrieved on demand via RAG.

Optimise: Parent-Child chunking, HyDE query rewriting

⚙️

Prompt Cache

Anthropic and OpenAI both support KV-cache reuse for static prefixes — system prompts, rules, large documents.

Saves up to 90% on input token costs & cuts TTFT

⚡ Prompt Caching — Two Distinct Scenarios

The fundamental requirement for a cache hit is that a significant block of text at the very beginning of the prompt is exactly identical to a previous request. The value and implementation differ depending on who is sharing the prefix.

Scenario 1 · Multi-Turn (Same Session)

The most common agentic use case. As the conversation grows, the prompt becomes [System] + [Turn 1] + [AI 1] + [Turn 2]…. In Turn 2 the model recognises that the first three components were already computed in Turn 1.

→Benefit: Only the new tokens are computed — not the growing history.

→Latency: Drastically reduces Time-To-First-Token (TTFT) as history grows.

→Why it matters for ReAct loops: Agents re-send the same system prompt and history on every iteration. Without caching, cost explodes with each hop.

Scenario 2 · Cross-User (Shared Global Cache)

Large-scale applications where thousands of users share the same "personality" or "knowledge base". If 10,000 users all start a "Dynamics 365 Finance Assistant" chat, they all share the same 5,000-token system message.

→Benefit: The provider caches those 5,000 tokens globally — the model only reads them once for every user.

→Savings: Up to 90% on input token costs; Azure OpenAI & Anthropic both support discounted "Cached Token" pricing.

⚠️ Common Trap: Personalisation Breaks the Cache

For Scenario 2 to work, the shared prefix must be exactly identical across users. If you personalise the system prompt — even with something as small as You are helping [User Name] — you change the very first tokens, busting the cache for every user.

❌ You are helping Alice Chen with Dynamics 365… — unique per user, no cache hit

vs

✅ You are a Dynamics 365 Finance Assistant… + user context appended after the shared prefix

💡 Token-Aware Routing

Use cheap, fast models (Claude Haiku, GPT-4o-mini) for intent classification and routing. Reserve expensive frontier models (Claude Sonnet, GPT-4o) only for complex reasoning steps that truly require it. This alone can cut inference costs by 60–80%.

Context Size vs Latency & Accuracy

Larger context windows improve recall but cost exponentially more time and money. RAG + prompt caching targets the sweet spot.

Token Economics: Three Approaches Compared

❌ Naive Full Context

Input tokens: ~80,000

Est. cost: ~$0.24 / call

TTFT: 8–14 s

Accuracy: Suffers from "lost-in-the-middle"

Not recommended at scale

⚡ RAG Only

Input tokens: ~6,000

Est. cost: ~$0.018 / call

TTFT: 2–4 s

Accuracy: High on specific queries

Good — but cold start latency

✅ RAG + Prompt Cache

Input tokens: ~6,000 (+cached prefix)

Est. cost: ~$0.004 / call

TTFT: 0.8–1.5 s

Accuracy: High + consistent system context

Production standard

Section 3

Reliability & Latency

Agents failing is expected and recoverable. Agents acting destructively is not. Both require deliberate architectural design.

⚡ Latency Reduction Tactics

1

Streaming (SSE)

Stream token-by-token. Users perceive Time-To-First-Token (TTFT), not total generation time. A fast TTFT with slow tail feels faster than batch delivery.

2

Parallel Tool Calling

LLM emits a JSON array of tool calls in one response. All tools execute concurrently. Only use when tools are independent — never if Tool B depends on Tool A's result.

3

Prompt Caching

Cache large static prefixes (system prompts, policy docs). Cached tokens are 10× cheaper and skip the encode step, dramatically reducing TTFT on repeated calls.

🛡️ Trustworthy Agent Design

Blast Radius

Define the maximum damage a misbehaving agent can cause. Scope write access via IAM roles — not just prompt instructions. A prompt that says "never delete" is not a security boundary.

Constitutional AI

Encode safety constraints in the system prompt. Useful — but insufficient alone. Prompt-level constraints are bypassable through prompt injection.

API / Tool Level Enforcement

The tool implementation itself must validate and refuse dangerous operations. The LLM calls the tool; the tool decides whether to execute. This is the real safety layer.

🧪 Evaluation Harness Engineering

Critical for Production

You cannot manage what you cannot measure. The hardest part of an agentic system is not building the agent — it is building the evaluation harness that tells you whether it works.

Sandboxed Harness

To test a coding agent you must build a Dockerised harness that:

Clones a fresh repository
Injects a GitHub issue as the task
Grants the agent its full tool set (Bash, Edit, Write)
Automatically runs the test suite when the agent terminates
Records pass / fail — no human judgment involved

The sandbox must mirror production exactly — same OS, same dependencies, same secrets — or your eval scores will diverge from real-world performance.

Deterministic Evals on a Non-Deterministic System

LLMs are stochastic. A single run tells you nothing. You need statistical confidence:

pass@1

Probability the agent succeeds on a single attempt. This is your primary production metric — what the user experiences.

pass@k

Probability of at least one success in k attempts. Useful for debugging — a low pass@1 but high pass@5 indicates the agent can solve the problem but struggles with consistency.

Integrate the Harness into CI/CD

Agent PR merged

→

Docker harness spins up

→

Run against golden dataset (N × each task)

→

Compute pass@1

→

Block if score drops > threshold

Track score trends over time to catch regressions before they reach production. A golden dataset of 50–200 representative tasks is usually sufficient to detect meaningful quality changes.

Interactive: Failover Cascade Simulator

See how a production agent degrades gracefully when its primary model fails.

1

Primary: Claude Sonnet (Anthropic)

Complex reasoning, 200k context, highest quality output

Standby

↓ timeout / rate-limit →

2

Fallback: GPT-4o (OpenAI)

Cross-provider redundancy eliminates single-vendor outage risk

Waiting

↓ timeout / rate-limit →

3

Degraded: Llama-3 8B (local / Ollama)

No external dependency — always available. Limited capability but keeps service up.

Waiting

Ready. Click to simulate an API timeout on the primary model.

Section 4

MCP & Advanced RAG

How your agent connects to the world determines its capabilities and its attack surface. Standardise your integrations and design your retrieval pipeline carefully.

🔗 Model Context Protocol (MCP)

Anthropic introduced MCP as an open standard to solve the fragmented tool-integration problem — every tool required a bespoke wrapper.

Before MCP

🔧 Custom GitHub wrapper → fragile, hard to maintain

🔧 Custom Slack wrapper → duplicated auth logic

🔧 Custom DB wrapper → different interface each time

N tools = N custom integrations = N security audit surfaces

With MCP

🟢 GitHub MCP Server — standardised protocol

🟢 Slack MCP Server — same client interface

🟢 Database MCP Server — unified auth model

1 client protocol × N servers = decoupled, auditable, swappable

📚 Advanced RAG Design Patterns

Pattern 1

Query Rewriting / HyDE

A vague user query like "how do agents handle memory?" embeds poorly. HyDE (Hypothetical Document Embeddings) first asks the LLM to write an ideal answer paragraph, then embeds that to search the vector DB. The hypothetical document is much closer in embedding space to the real answer.

User query → LLM writes hypothetical answer → embed → vector search → retrieve real docs → LLM answers

Pattern 2

Parent-Child Retrieval

Embed small, precise chunks (128 tokens) for high-accuracy similarity matching. When a chunk is retrieved, return its full parent document (1024+ tokens) to the LLM. Solves the tradeoff between embedding precision and answer completeness.

Small chunks: precise matching, no context leakage

Large parent: LLM gets full surrounding context

Pattern 3

Graph RAG

Build a knowledge graph over your document corpus — entities as nodes, relationships as edges. Enables multi-hop reasoning: "How does X relate to Y?" Standard vector search returns similar documents; Graph RAG traverses connected concepts. Best for complex domains where relationships between entities matter.

Trade-off: Significantly higher ingestion cost and infrastructure complexity. Justifiable only when relational queries are common.

Section 5

Case Studies & Design Framework

Real-world architectures and a structured 6-pillar framework for designing any agentic system.

Core Philosophy

An AI coding assistant is not a chat window — it is a production-grade Agent Runtime. Every engineering decision (context compression, speculative execution, state management) optimises for reliability, cost, and latency at the same time, rather than trading one off against another.

Primary Risk Guarded Against

Silent failure at scale. A 7-layer recovery cascade (API backoff → overload handling → token recovery → context compression → context purging → persistent retry → emergency compaction) ensures the agent self-heals from network jitter, API overload, and context overflow rather than crashing silently.

The ReAct Loop — 5 Stages (`while true`)

1 · Context Prep
prune + compress history

→

2 · Streaming Invocation
SSE model call

→

3 · Tool Execution
streaming + batch executors

→

4 · Artifact Collection
tasks · memory · diffs

→

5 · Continue / Terminate
or recover via 7-layer cascade

6 Engineering Highlights

1 · Prompt Cache Segmentation

System prompt is split at a system_prompt_dynamic_boundary marker. The static half (role, tool rules, coding philosophy) is flagged for global cache sharing across all users. The dynamic half (memory, MCP instructions, environment) is never cached. Result: maximum cache-hit rate and up to 90 % reduction in input-token cost.

2 · Four-Tier Context Compression

Snip — lightweight trim before each API call.
Micro-compact — cache-aware, time-based, or API-level compression.
Auto-compact — AI summarisation when token threshold is hit.
reactorcompact — emergency compaction on a 413 overflow error, followed by intelligent restoration of recently accessed files.

3 · Speculative Execution

Tools begin executing in a copy-on-write overlay filesystem before the user confirms. If confirmed, the overlay is merged to disk; if rejected, the overlay is discarded — the real filesystem is untouched. Suggestions are pipelined: the next action starts speculatively while the user reviews the current one, mirroring CPU instruction pipelines to mask confirmation latency.

4 · 20-Check Command Security

Every shell command passes 20 security checks before execution: JQ-injection detection, newline injection, command substitution patterns, IFS injection, Unicode whitespace masquerading, token-theft attempts, and more. In autonomous mode an interpreter blacklist blocks Python / Node / Ruby / Perl / PHP from running without explicit user confirmation.

5 · Zustand-Style State Store

A custom lightweight state store — inspired by Zustand but built for terminal React Ink rendering — holds 100+ global properties (settings, task queues, tool configs, permissions, MCP status, speculative execution state). Object-identity comparison and selector subscriptions ensure re-renders fire only when the subscribed field actually changes, preventing cascading repaints in the terminal UI.

6 · Worker System (6 types · 24 events)

Command — shell execution | Prompt — LLM review | Agent — full multi-turn session | HTTP — external endpoints | Callback — internal TS functions | Function — boolean checks.
24 event types span pre/post tool execution, API requests, conversation lifecycle, compression triggers, and user input — letting enterprise teams customise behaviour (e.g. auto-log every Bash call, security-review before writes) without touching core source code.

Multi-Agent Architecture

Fork Agent

Child inherits parent's full context, runs in an independent process branch.

In-Process Agent

Same process, AsyncLocalStorage for context isolation — lower overhead.

Split-Pane Agent

Leader + Teammate rendered side-by-side in a Tmux split — visible parallelism.

Coordinator Mode — a central orchestrator decomposes tasks into sub-tasks, assigns each to a worker agent with its own prompt, tool set, and model. Four built-in phases: Research → Synthesis → Implementation → Verification. Built-in roles: Planning Agent, Exploration Agent, Verification Agent, and Agentic Coding Guide Agent. Custom agents can be defined via config.

Design Walkthrough

"Design a multi-agent system to automate month-end bank reconciliation. Input: unstructured bank statements + structured GL data. Output: a reconciled ledger and a full audit trail that explains every automated decision."

This is a classic High-Precision agentic workflow. In a financial context, LLMs handle semantic reasoning; deterministic tools handle all computation and writes.

Core Philosophy

Rules clear the easy 80%; agents handle the noisy 20%. The system transitions through three roles: Pattern Matcher → Context Hunter → Bookkeeper. LLMs reason, but every number is produced by a deterministic tool — never by the model itself.

Primary Risk Guarded Against

Hallucinated math and unauditable decisions. A Calculation_Tool owns all arithmetic. Every tool call is written to a structured JSON trace that feeds a human-readable PDF audit report — no black-box reasoning makes it into the GL.

Agent Definitions

A · Verification Agent
The Auditor

Entry point. Joins bank statement to GL, clears exact matches, flags the 20% of noisy discrepancies.

Skills

Fuzzy matching (near-value transactions)
Entity resolution ("MSFT *REDMOND" → "Microsoft Corp")

Tools

SQL_Query_GL
Vector_Search_Vendors (RAG)
Discrepancy_Logger

B · Researcher Agent
The Investigator

Most "agentic" part. Triggered per investigation ticket — infers cause from emails, PDFs, and bank memos.

Skills

Contextual inference (e.g. SWIFT fees)
Unstructured data synthesis
Grouped payment detection

Tools

Email_RAG_Tool
Document_Parser (OCR)
Bank_API_Interface

C · Resolution Agent
The Bookkeeper

Highly constrained. Prepares — never auto-posts — journal entries; generates the audit trail.

Skills

Double-entry logic (debits must balance credits)
Compliance mapping (reason codes)

Tools

D365_Journal_Draft_Creator
Audit_Trail_Generator

Agentic Loop Flow

1 · Intake
Bank ⋈ GL join
clear exact matches

→

2 · Handoff
create investigation
ticket + delta hash

→

3 · Inference
Researcher reasons
over ticket

→

4 · HITL Gate
< 90% confidence
→ flag human

→

5 · Resolution
stage D365 draft
+ generate audit PDF

Technical Guardrails

Deterministic Math

Never let the LLM do arithmetic. A Calculation_Tool accepts (val1, val2) and returns val1 - val2. The model passes the operands; the tool owns the result. Prevents hallucinated subtraction that would corrupt the GL.

Human-in-the-Loop Trigger

A confidence score gates the Resolution Agent. If the Researcher finds multiple plausible explanations (e.g. three emails that could each explain a fee), confidence drops below 90% and the transaction is routed to a human reviewer rather than auto-resolved.

Auditable Reasoning Trace

Every tool call is logged to a structured JSON chain:
[VerificationAgent: $5 delta] → [ResearcherAgent: Email_Tool('Inv-505') → "Service Fee"] → [ResolutionAgent: mapped GL 60500]
This trace feeds the Audit_Trail_Generator PDF, required for SOX compliance.

IAM — Least Privilege

Each agent runs under a separate Service Principal. Verification Agent: read-only GL. Researcher Agent: read-only email + document storage. Resolution Agent: the only principal with Write access to the ERP. No agent can escalate its own permissions.

Follow-up: How would you reduce latency?

1

Fan-Out / Map-Reduce

Spawn N Researcher Agent instances in parallel — one per investigation ticket — instead of processing sequentially. A central Aggregator Agent deduplicates findings so two researchers don't claim the same "found money."

2

Async Parallel Tool Execution

When a Researcher needs both Bank_API and Email_RAG, fire both simultaneously with asyncio.gather() or a task queue (Celery / Temporal). The agent pauses its state, waits for all results, then resumes — cutting I/O wait in half.

3

Speculative Execution

While the Researcher investigates, the Resolution Agent pre-stages the two most likely journal entry drafts ("FX Loss" and "Bank Fee"). Once the Researcher returns a verdict, the correct draft is committed immediately — the resolution step is already done.

4

Message Broker at Scale

For 10,000+ month-end transactions: a Kafka/RabbitMQ queue holds investigation tasks; a worker pool of Researcher Agents pulls from it. A NoSQL result store (CosmosDB) checkpoints intermediate reasoning so a crashed agent can be resumed — not restarted — by another worker.

5

Reasoning Cache (Semantic KV)

If a $1.50 delta for "Vendor X" has already been resolved, cache the reasoning result. The next identical discrepancy skips the Researcher entirely and goes straight to Resolution — one LLM call instead of three.

Agentic System Components

The six core pillars every production agentic system must address.

1. Scope & Blast Radius

Start by asking: is this agent read-only or does it have write capabilities? If it writes — to a database, filesystem, or external API — you must immediately define the blast radius: what is the worst-case action it could take, and how do you contain it?

Apply the principle of least privilege via IAM roles — never give the agent broader access than a single task requires. For irreversible or high-impact actions (e.g. sending an email, deleting a record, executing a trade), insert a Human-in-the-Loop (HITL) approval gate that pauses execution and routes to a human before proceeding. Design actions to be reversible wherever possible — prefer soft deletes, staged commits, and dry-run modes.

2. State Machine / DAG

Model your agent as a directed acyclic graph (DAG): Entry Point → Router → Specialist Agents → Output. The Router classifies the user's intent and dispatches to the appropriate specialist (e.g. a retrieval agent, a code agent, a summarisation agent). Edges represent conditional routing logic — an agent's output determines which node runs next.

This is a Managed Agent architecture: at every node transition, serialise the full graph state to a persistent store (Postgres, Redis). This makes the agent interruptible and inspectable — if the server restarts mid-task, it can resume from the last saved checkpoint rather than starting over. It also enables HITL pauses: the agent suspends at a node, waits for human approval, then resumes exactly where it left off.

3. Data Ingestion (RAG / MCP)

Define how the agent securely connects to data. Use the Model Context Protocol (MCP) as a standardised client-server interface — instead of hardcoding API wrappers, run MCP Servers (GitHub MCP, Slack MCP, database MCP) that the agent can call uniformly. This decouples agent logic from data sources and improves security by keeping credentials server-side.

For unstructured knowledge retrieval, choose your RAG pattern based on the query type: HyDE rewrites vague queries into hypothetical ideal documents before hitting the vector DB; Parent-Child embeds small precise chunks for high-accuracy retrieval but returns the larger parent document for full context; Graph RAG builds a knowledge graph to support multi-hop reasoning across connected entities. Define your chunk size (typically 256–512 tokens), embedding model, and similarity threshold explicitly.

4. The Agent Loop

Describe the specific Reason-Act cycle your agent runs. The ReAct pattern: Observe current state → Reason inside a <thinking> block (inner monologue, not shown to user) → Act by emitting a tool call → Observe the tool result → repeat. The <thinking> phase is critical — it forces the model to plan before acting, dramatically reducing impulsive or incorrect tool calls.

Define what counts as one "step" (typically one tool call + observation), and set a hard cap on steps before forcing human review — a common default is 10–15 steps. Beyond that threshold, the agent should surface its current progress to a human rather than continuing autonomously, preventing runaway loops. Also define your termination conditions: what constitutes task completion vs. task failure?

5. Evaluation Harness

You cannot manage what you cannot measure. Before shipping any agent update, run it against a golden dataset of representative tasks through a sandboxed evaluation harness. For a coding agent, this means a Dockerised environment that clones a repo, gives the agent a GitHub issue, lets it run commands, and then automatically executes the test suite to score the result — no human judgment required.

Because LLMs are non-deterministic, run each task multiple times and compute pass@1 (probability of success on a single attempt) and pass@k (at least one success in k attempts). Integrate this harness into CI/CD — a pull request that degrades pass@1 by more than a threshold should be blocked automatically. Track score trends over time to catch regressions before they reach production.

6. Bottlenecks (Latency / Cost / Reliability)

Latency: Always stream responses via SSE so the user sees the first token immediately rather than waiting for the full response. Use parallel tool calling — instead of sequential tool execution, emit a JSON array of tool calls so multiple tools run concurrently, cutting I/O wait time in half.

Cost: Implement prompt caching on large, stable inputs — system prompts, rule sets, retrieved documents — reducing input token costs by up to 90%. Use token-aware model routing: cheap small models (Haiku, GPT-4o-mini) for simple classification and routing decisions; expensive large models (Sonnet, GPT-4o) only for complex multi-step reasoning.

Reliability: Implement a semantic routing failover cascade — if the primary model times out, hits a rate limit, or triggers a safety filter incorrectly, automatically fall back to a secondary provider within the same request. Log every tool call, input, and output for post-hoc debugging. Set circuit breakers on external tool calls so a single flaky API can't hang the entire agent loop.

Section 6

Agentic System Evaluation

You cannot manage what you cannot measure. Evaluation science for agentic systems requires frameworks purpose-built for non-determinism, multi-agent coordination, and production observability.

Why Evaluation is Different for Agents

Traditional software operates within deterministic bounds. Agents introduce non-determinism — the same prompt can yield different tool selections, reasoning chains, and outcomes across runs. Agent success rates on complex tasks can drop from 60% to 25% when tested for consistency, a failure mode invisible to single-turn testing.

Traditional Observability	Agentic Observability
Focuses on infrastructure (CPU, Memory, Latency)	Focuses on reasoning loops, tool calls, and trajectories
Deterministic paths with reproducible execution	Non-deterministic paths with stochastic deviation
Failure signaled by error codes and timeouts	Failure signaled by degraded quality or hallucination
Metrics: Uptime, Throughput, Error Rate	Metrics: Task Adherence, Tool Selection Quality, Autonomy Index

Multi-Dimensional Evaluation Frameworks

CLASSic Framework

Five core dimensions for enterprise agentic evaluation:

Cost — token spend and infrastructure cost per task
Latency — time-to-completion and TTFT
Accuracy — task success and reasoning fidelity
Stability — consistency across N runs
Security — blast radius containment and policy compliance

Four-Pillar Breakdown

Partition evaluation to isolate failure origin:

LLMs — foundation model reasoning quality
Memory — retrieval accuracy and context management
Tools — selection quality and output utilization
Environment — API reliability and external system behavior

Domain-specific agents achieve ~82.7% accuracy vs. 59–63% for general LLMs.

Key Evaluation Metrics

pass@1 / pass@k

pass@1 — probability of success on a single attempt. pass@k — probability of at least one success in k attempts. Run each task multiple times; LLMs are non-deterministic. A single manual check is statistically meaningless.

Autonomy Index (AIₓ)

Proportion of task steps executed without human intervention:

AIₓ = 1 − (Human Interventions / Total Steps)

Primary ROI signal for agentic deployments.

Process Metrics

Tool Selection Quality — did the agent pick the right tool with correct params?
Step Efficiency — actual steps vs. optimal path length.
Task Adherence — did the agent follow system instructions throughout?

Sandboxed Evaluation Harnesses

For a coding agent: a Dockerised harness clones a repo, gives the agent a GitHub issue, lets it run commands, then automatically executes the test suite — no human judgment required. This is the gold standard.

CI/CD Integration

Gate pull requests on pass@1. A PR that degrades pass@1 by more than a defined threshold is automatically blocked. Track score trends over time to catch regressions before they reach production. A "golden dataset" of representative failures and successes is foundational to calibrate LLM-as-judge metrics.

Stochastic Regression Detection (SPRT)

Wald's Sequential Probability Ratio Test reduces required trials by up to 78% while maintaining statistical rigor. Uses three-valued verdicts — Pass, Fail, Inconclusive — rather than binary. Detects silent model-update regressions (e.g. 93% → 71% accuracy) that binary tests miss entirely.

Standard Benchmarks

Benchmark	Primary Focus	Key Capability Tested
SWE-bench	Software Engineering	Long-horizon reasoning, code navigation, tool usage (search/edit)
WebArena	Web Interaction	Multi-step objectives in realistic, long-horizon web environments
AgencyBench	General Agency	6 core capabilities across 32 real-world scenarios (1M+ tokens)
ALFWorld	Embodied Reasoning	Planning and object manipulation in simulated household environments
BrowserGym	UI Reliability	Handling UI changes, form filling, recovering from navigation errors

Multi-Agent System (MAS) Evaluation

MAS evaluation must partition into individual agent performance, interaction-level dynamics, and system-level goals. The MAST framework organises MAS evaluation around top-level error categories: task decomposition failures, communication bottlenecks, and conflict-of-interest resolution.

Coordination Metrics

Communication Efficiency — utility of inter-agent information exchange.
Decision Synchronization — alignment of actions across agents.
Resource Contention — detect agents competing for the same API rate limits or tool access.

Audited Handoff Protocol

Every agent-to-agent transition is treated as a trust boundary. Four phases: Prepare → Validate → Approve → Commit. Prevents coordinate-transformation errors and data misalignments from propagating downstream between agents.

MAS Architecture	Description	Key Evaluation Concern
Supervisor	Single agent routes all tasks	Supervisor decision accuracy and routing efficiency
Network	Agents communicate freely	Communication efficiency and agent selection quality
Hierarchical	Supervisors of supervisors	Context transfer coherence and multi-level decision making
Custom Workflow	Predetermined communication paths	Workflow efficiency and clarity of handoff points

Observability & Tracing

Agent traces are hierarchical trees — a root span for the invocation contains child spans for task planning, sub-agent delegation, and tool execution. The industry is converging on OpenTelemetry GenAI semantic conventions for consistent instrumentation across frameworks (LangGraph, CrewAI, AutoGen).

OTel GenAI Operations

invoke_agent → top-level agent execution

execute_tool → specific tool / API call

chat → standard LLM inference request

create_agent → agent initialization

Behavioral Fingerprinting

Map execution traces (tool usage, reasoning tokens, state transitions) to compact vectors. Apply multivariate statistical tests to detect anomalies — achieves 86% detection power for regressions where traditional binary pass/fail testing has 0%. Identifies silent failures like "self-deception" where an agent shortcuts a task to hide its inability to find a solution.

Safe Deployment Strategies

Canary deployments for agents focus on blast-radius containment — traffic shifts incrementally while maintaining a stable baseline. Rollback triggers are based on p99 latency, error rates, and automated quality metrics.

Strategy	Mechanism	Best Use Case for Agents
Shadow Mode	Parallel execution, no user impact	Validating new prompts or tool logic against live traffic
Canary Release	Phased traffic shift (1% → 10% → 100%)	Minimising risk of emergent failure modes or reasoning drift
A/B Testing	Split traffic between two active versions	Comparing model efficiency and cost-to-quality tradeoffs
Blue/Green	Switch all traffic to a new environment	Rapid deployment and easy rollback for infrastructure changes

Section 7

Knowledge Check

Answer questions in your own words. An AI evaluator will score your answer 1–5 and give detailed feedback on what you got right and what to strengthen.

Note: This feature requires the Python server to be running (python server.py with ANTHROPIC_API_KEY set).

Progress 0 / 19 answered

—

Avg Score

0

🔥 Streak

— —

Click "Challenge Me" to receive a question.

Section 8 · Live

Living Learning Feed

Daily-curated research, enriched with learning connections to each course section. Refreshes automatically.

Last updated: —

▶ Update Pipeline

① Fetch RSS → ② Filter → ③ Summarize → ④ Verify → ⑤ Synthesize → ⑥ Coverage → ⑦ Learn Links → ⑧ Archive + Publish

⏰ Scheduling Daily Auto-Updates

Add a cron job to run the fetcher automatically every morning:

# Run daily at 8:00 AM — add to crontab (crontab -e)
0 8 * * * cd /Users/avocado21/Documents/github/AgenticAgents && ANTHROPIC_API_KEY=sk-ant-... python3 fetch_updates.py >> fetch.log 2>&1

Or run manually any time: python3 fetch_updates.py — use --dry-run to preview without writing.

🤖 Agentic Systems

ReAct Loop

Workflows vs Agents

State Machines / DAGs

🏗️ Managed Agents & Infrastructure

Enterprise Architecture Overview

Interactive: ReAct Loop Simulator

Short-Term Memory

Long-Term Memory

Prompt Cache

⚡ Prompt Caching — Two Distinct Scenarios

💡 Token-Aware Routing

Context Size vs Latency & Accuracy

Token Economics: Three Approaches Compared

⚡ Latency Reduction Tactics

🛡️ Trustworthy Agent Design

🧪 Evaluation Harness Engineering

Interactive: Failover Cascade Simulator

🔗 Model Context Protocol (MCP)

Before MCP

With MCP

📚 Advanced RAG Design Patterns

Query Rewriting / HyDE

Parent-Child Retrieval

Graph RAG

The ReAct Loop — 5 Stages (while true)

6 Engineering Highlights

Multi-Agent Architecture

Agent Definitions

Agentic Loop Flow

Technical Guardrails

Follow-up: How would you reduce latency?

Agentic System Components

Why Evaluation is Different for Agents

Multi-Dimensional Evaluation Frameworks

Key Evaluation Metrics

Sandboxed Evaluation Harnesses

Standard Benchmarks

Multi-Agent System (MAS) Evaluation

Observability & Tracing

Safe Deployment Strategies

Curator's Synthesis

Topic Coverage Report

⏰ Scheduling Daily Auto-Updates

Day's Synthesis

The ReAct Loop — 5 Stages (`while true`)