Agent Orchestration & Frameworks
Orchestration is the central nervous system of your system. It dictates how tasks are broken down, how tools are selected, and how agents collaborate.
ReAct Loop
Observe → Reason → Act → Observe. The LLM reasons about the state, calls a tool, then processes the result — cycling until the task is done.
Workflows vs Agents
Anthropic recommends predictable workflows for 90% of tasks (routing, chaining, parallelisation) and reserving autonomous agents for truly open-ended problems.
State Machines / DAGs
Production standard. Represent workflows as directed graphs — nodes are agents/functions, edges are conditional routing logic. Enables pause, resume, and inspection.
🏗️ Managed Agents & Infrastructure
Moving from a local script to production requires a lifecycle manager, not just a deployed model.
State Persistence
Save the agent's full graph state to a database (e.g., Postgres) at every node transition so it survives server restarts and can be inspected.
Human-in-the-Loop (HITL)
Pause execution and route high-stakes actions to a human approval queue before resuming — e.g., "Can I execute this DROP TABLE command?"
Enterprise Architecture Overview
Hover each node for details.
small model
Postgres
approval gate
specialist
GitHub/Slack/…
long-term memory
Interactive: ReAct Loop Simulator
Step through the agent's reasoning cycle.
Memory & Context Management
Context is your most precious resource. How you fill it, trim it, and cache it determines your agent's cost, speed, and quality.
Short-Term Memory
The active context window — conversation history, recent tool outputs. Limited by token budget.
Optimise: sliding window truncation or LLM-based summarisation
Long-Term Memory
Episodic and semantic memory stored in vector databases, retrieved on demand via RAG.
Optimise: Parent-Child chunking, HyDE query rewriting
Prompt Cache
Anthropic and OpenAI both support KV-cache reuse for static prefixes — system prompts, rules, large documents.
Saves up to 90% on input token costs & cuts TTFT
⚡ Prompt Caching — Two Distinct Scenarios
The fundamental requirement for a cache hit is that a significant block of text at the very beginning of the prompt is exactly identical to a previous request. The value and implementation differ depending on who is sharing the prefix.
Scenario 1 · Multi-Turn (Same Session)
The most common agentic use case. As the conversation grows, the prompt becomes [System] + [Turn 1] + [AI 1] + [Turn 2]…. In Turn 2 the model recognises that the first three components were already computed in Turn 1.
Scenario 2 · Cross-User (Shared Global Cache)
Large-scale applications where thousands of users share the same "personality" or "knowledge base". If 10,000 users all start a "Dynamics 365 Finance Assistant" chat, they all share the same 5,000-token system message.
⚠️ Common Trap: Personalisation Breaks the Cache
For Scenario 2 to work, the shared prefix must be exactly identical across users. If you personalise the system prompt — even with something as small as You are helping [User Name] — you change the very first tokens, busting the cache for every user.
You are helping Alice Chen with Dynamics 365… — unique per user, no cache hitYou are a Dynamics 365 Finance Assistant… + user context appended after the shared prefix💡 Token-Aware Routing
Use cheap, fast models (Claude Haiku, GPT-4o-mini) for intent classification and routing. Reserve expensive frontier models (Claude Sonnet, GPT-4o) only for complex reasoning steps that truly require it. This alone can cut inference costs by 60–80%.
Context Size vs Latency & Accuracy
Larger context windows improve recall but cost exponentially more time and money. RAG + prompt caching targets the sweet spot.
Token Economics: Three Approaches Compared
Reliability & Latency
Agents failing is expected and recoverable. Agents acting destructively is not. Both require deliberate architectural design.
⚡ Latency Reduction Tactics
Streaming (SSE)
Stream token-by-token. Users perceive Time-To-First-Token (TTFT), not total generation time. A fast TTFT with slow tail feels faster than batch delivery.
Parallel Tool Calling
LLM emits a JSON array of tool calls in one response. All tools execute concurrently. Only use when tools are independent — never if Tool B depends on Tool A's result.
Prompt Caching
Cache large static prefixes (system prompts, policy docs). Cached tokens are 10× cheaper and skip the encode step, dramatically reducing TTFT on repeated calls.
🛡️ Trustworthy Agent Design
Blast Radius
Define the maximum damage a misbehaving agent can cause. Scope write access via IAM roles — not just prompt instructions. A prompt that says "never delete" is not a security boundary.
Constitutional AI
Encode safety constraints in the system prompt. Useful — but insufficient alone. Prompt-level constraints are bypassable through prompt injection.
API / Tool Level Enforcement
The tool implementation itself must validate and refuse dangerous operations. The LLM calls the tool; the tool decides whether to execute. This is the real safety layer.
🧪 Evaluation Harness Engineering
Critical for ProductionYou cannot manage what you cannot measure. The hardest part of an agentic system is not building the agent — it is building the evaluation harness that tells you whether it works.
Sandboxed Harness
To test a coding agent you must build a Dockerised harness that:
- Clones a fresh repository
- Injects a GitHub issue as the task
- Grants the agent its full tool set (Bash, Edit, Write)
- Automatically runs the test suite when the agent terminates
- Records pass / fail — no human judgment involved
The sandbox must mirror production exactly — same OS, same dependencies, same secrets — or your eval scores will diverge from real-world performance.
Deterministic Evals on a Non-Deterministic System
LLMs are stochastic. A single run tells you nothing. You need statistical confidence:
pass@1
Probability the agent succeeds on a single attempt. This is your primary production metric — what the user experiences.
pass@k
Probability of at least one success in k attempts. Useful for debugging — a low pass@1 but high pass@5 indicates the agent can solve the problem but struggles with consistency.
Integrate the Harness into CI/CD
Track score trends over time to catch regressions before they reach production. A golden dataset of 50–200 representative tasks is usually sufficient to detect meaningful quality changes.
Interactive: Failover Cascade Simulator
See how a production agent degrades gracefully when its primary model fails.
Primary: Claude Sonnet (Anthropic)
Complex reasoning, 200k context, highest quality output
Fallback: GPT-4o (OpenAI)
Cross-provider redundancy eliminates single-vendor outage risk
Degraded: Llama-3 8B (local / Ollama)
No external dependency — always available. Limited capability but keeps service up.
MCP & Advanced RAG
How your agent connects to the world determines its capabilities and its attack surface. Standardise your integrations and design your retrieval pipeline carefully.
🔗 Model Context Protocol (MCP)
Anthropic introduced MCP as an open standard to solve the fragmented tool-integration problem — every tool required a bespoke wrapper.
Before MCP
With MCP
📚 Advanced RAG Design Patterns
Query Rewriting / HyDE
A vague user query like "how do agents handle memory?" embeds poorly. HyDE (Hypothetical Document Embeddings) first asks the LLM to write an ideal answer paragraph, then embeds that to search the vector DB. The hypothetical document is much closer in embedding space to the real answer.
Parent-Child Retrieval
Embed small, precise chunks (128 tokens) for high-accuracy similarity matching. When a chunk is retrieved, return its full parent document (1024+ tokens) to the LLM. Solves the tradeoff between embedding precision and answer completeness.
Graph RAG
Build a knowledge graph over your document corpus — entities as nodes, relationships as edges. Enables multi-hop reasoning: "How does X relate to Y?" Standard vector search returns similar documents; Graph RAG traverses connected concepts. Best for complex domains where relationships between entities matter.
Case Studies & Design Framework
Real-world architectures and a structured 6-pillar framework for designing any agentic system.
Core Philosophy
An AI coding assistant is not a chat window — it is a production-grade Agent Runtime. Every engineering decision (context compression, speculative execution, state management) optimises for reliability, cost, and latency at the same time, rather than trading one off against another.
Primary Risk Guarded Against
Silent failure at scale. A 7-layer recovery cascade (API backoff → overload handling → token recovery → context compression → context purging → persistent retry → emergency compaction) ensures the agent self-heals from network jitter, API overload, and context overflow rather than crashing silently.
The ReAct Loop — 5 Stages (while true)
prune + compress history
SSE model call
streaming + batch executors
tasks · memory · diffs
or recover via 7-layer cascade
6 Engineering Highlights
1 · Prompt Cache Segmentation
System prompt is split at a system_prompt_dynamic_boundary marker. The static half (role, tool rules, coding philosophy) is flagged for global cache sharing across all users. The dynamic half (memory, MCP instructions, environment) is never cached. Result: maximum cache-hit rate and up to 90 % reduction in input-token cost.
2 · Four-Tier Context Compression
Snip — lightweight trim before each API call.
Micro-compact — cache-aware, time-based, or API-level compression.
Auto-compact — AI summarisation when token threshold is hit.
reactorcompact — emergency compaction on a 413 overflow error, followed by intelligent restoration of recently accessed files.
3 · Speculative Execution
Tools begin executing in a copy-on-write overlay filesystem before the user confirms. If confirmed, the overlay is merged to disk; if rejected, the overlay is discarded — the real filesystem is untouched. Suggestions are pipelined: the next action starts speculatively while the user reviews the current one, mirroring CPU instruction pipelines to mask confirmation latency.
4 · 20-Check Command Security
Every shell command passes 20 security checks before execution: JQ-injection detection, newline injection, command substitution patterns, IFS injection, Unicode whitespace masquerading, token-theft attempts, and more. In autonomous mode an interpreter blacklist blocks Python / Node / Ruby / Perl / PHP from running without explicit user confirmation.
5 · Zustand-Style State Store
A custom lightweight state store — inspired by Zustand but built for terminal React Ink rendering — holds 100+ global properties (settings, task queues, tool configs, permissions, MCP status, speculative execution state). Object-identity comparison and selector subscriptions ensure re-renders fire only when the subscribed field actually changes, preventing cascading repaints in the terminal UI.
6 · Worker System (6 types · 24 events)
Command — shell execution | Prompt — LLM review | Agent — full multi-turn session | HTTP — external endpoints | Callback — internal TS functions | Function — boolean checks.
24 event types span pre/post tool execution, API requests, conversation lifecycle, compression triggers, and user input — letting enterprise teams customise behaviour (e.g. auto-log every Bash call, security-review before writes) without touching core source code.
Multi-Agent Architecture
Fork Agent
Child inherits parent's full context, runs in an independent process branch.
In-Process Agent
Same process, AsyncLocalStorage for context isolation — lower overhead.
Split-Pane Agent
Leader + Teammate rendered side-by-side in a Tmux split — visible parallelism.
Design Walkthrough
"Design a multi-agent system to automate month-end bank reconciliation. Input: unstructured bank statements + structured GL data. Output: a reconciled ledger and a full audit trail that explains every automated decision."
This is a classic High-Precision agentic workflow. In a financial context, LLMs handle semantic reasoning; deterministic tools handle all computation and writes.
Core Philosophy
Rules clear the easy 80%; agents handle the noisy 20%. The system transitions through three roles: Pattern Matcher → Context Hunter → Bookkeeper. LLMs reason, but every number is produced by a deterministic tool — never by the model itself.
Primary Risk Guarded Against
Hallucinated math and unauditable decisions. A Calculation_Tool owns all arithmetic. Every tool call is written to a structured JSON trace that feeds a human-readable PDF audit report — no black-box reasoning makes it into the GL.
Agent Definitions
A · Verification Agent
The Auditor
Entry point. Joins bank statement to GL, clears exact matches, flags the 20% of noisy discrepancies.
Skills
- Fuzzy matching (near-value transactions)
- Entity resolution ("MSFT *REDMOND" → "Microsoft Corp")
Tools
SQL_Query_GLVector_Search_Vendors(RAG)Discrepancy_Logger
B · Researcher Agent
The Investigator
Most "agentic" part. Triggered per investigation ticket — infers cause from emails, PDFs, and bank memos.
Skills
- Contextual inference (e.g. SWIFT fees)
- Unstructured data synthesis
- Grouped payment detection
Tools
Email_RAG_ToolDocument_Parser(OCR)Bank_API_Interface
C · Resolution Agent
The Bookkeeper
Highly constrained. Prepares — never auto-posts — journal entries; generates the audit trail.
Skills
- Double-entry logic (debits must balance credits)
- Compliance mapping (reason codes)
Tools
D365_Journal_Draft_CreatorAudit_Trail_Generator
Agentic Loop Flow
Bank ⋈ GL join
clear exact matches
create investigation
ticket + delta hash
Researcher reasons
over ticket
< 90% confidence
→ flag human
stage D365 draft
+ generate audit PDF
Technical Guardrails
Deterministic Math
Never let the LLM do arithmetic. A Calculation_Tool accepts (val1, val2) and returns val1 - val2. The model passes the operands; the tool owns the result. Prevents hallucinated subtraction that would corrupt the GL.
Human-in-the-Loop Trigger
A confidence score gates the Resolution Agent. If the Researcher finds multiple plausible explanations (e.g. three emails that could each explain a fee), confidence drops below 90% and the transaction is routed to a human reviewer rather than auto-resolved.
Auditable Reasoning Trace
Every tool call is logged to a structured JSON chain:
[VerificationAgent: $5 delta] → [ResearcherAgent: Email_Tool('Inv-505') → "Service Fee"] → [ResolutionAgent: mapped GL 60500]
This trace feeds the Audit_Trail_Generator PDF, required for SOX compliance.
IAM — Least Privilege
Each agent runs under a separate Service Principal. Verification Agent: read-only GL. Researcher Agent: read-only email + document storage. Resolution Agent: the only principal with Write access to the ERP. No agent can escalate its own permissions.
Follow-up: How would you reduce latency?
Fan-Out / Map-Reduce
Spawn N Researcher Agent instances in parallel — one per investigation ticket — instead of processing sequentially. A central Aggregator Agent deduplicates findings so two researchers don't claim the same "found money."
Async Parallel Tool Execution
When a Researcher needs both Bank_API and Email_RAG, fire both simultaneously with asyncio.gather() or a task queue (Celery / Temporal). The agent pauses its state, waits for all results, then resumes — cutting I/O wait in half.
Speculative Execution
While the Researcher investigates, the Resolution Agent pre-stages the two most likely journal entry drafts ("FX Loss" and "Bank Fee"). Once the Researcher returns a verdict, the correct draft is committed immediately — the resolution step is already done.
Message Broker at Scale
For 10,000+ month-end transactions: a Kafka/RabbitMQ queue holds investigation tasks; a worker pool of Researcher Agents pulls from it. A NoSQL result store (CosmosDB) checkpoints intermediate reasoning so a crashed agent can be resumed — not restarted — by another worker.
Reasoning Cache (Semantic KV)
If a $1.50 delta for "Vendor X" has already been resolved, cache the reasoning result. The next identical discrepancy skips the Researcher entirely and goes straight to Resolution — one LLM call instead of three.
Agentic System Components
The six core pillars every production agentic system must address.
1. Scope & Blast Radius
Start by asking: is this agent read-only or does it have write capabilities? If it writes — to a database, filesystem, or external API — you must immediately define the blast radius: what is the worst-case action it could take, and how do you contain it?
Apply the principle of least privilege via IAM roles — never give the agent broader access than a single task requires. For irreversible or high-impact actions (e.g. sending an email, deleting a record, executing a trade), insert a Human-in-the-Loop (HITL) approval gate that pauses execution and routes to a human before proceeding. Design actions to be reversible wherever possible — prefer soft deletes, staged commits, and dry-run modes.
2. State Machine / DAG
Model your agent as a directed acyclic graph (DAG): Entry Point → Router → Specialist Agents → Output. The Router classifies the user's intent and dispatches to the appropriate specialist (e.g. a retrieval agent, a code agent, a summarisation agent). Edges represent conditional routing logic — an agent's output determines which node runs next.
This is a Managed Agent architecture: at every node transition, serialise the full graph state to a persistent store (Postgres, Redis). This makes the agent interruptible and inspectable — if the server restarts mid-task, it can resume from the last saved checkpoint rather than starting over. It also enables HITL pauses: the agent suspends at a node, waits for human approval, then resumes exactly where it left off.
3. Data Ingestion (RAG / MCP)
Define how the agent securely connects to data. Use the Model Context Protocol (MCP) as a standardised client-server interface — instead of hardcoding API wrappers, run MCP Servers (GitHub MCP, Slack MCP, database MCP) that the agent can call uniformly. This decouples agent logic from data sources and improves security by keeping credentials server-side.
For unstructured knowledge retrieval, choose your RAG pattern based on the query type: HyDE rewrites vague queries into hypothetical ideal documents before hitting the vector DB; Parent-Child embeds small precise chunks for high-accuracy retrieval but returns the larger parent document for full context; Graph RAG builds a knowledge graph to support multi-hop reasoning across connected entities. Define your chunk size (typically 256–512 tokens), embedding model, and similarity threshold explicitly.
4. The Agent Loop
Describe the specific Reason-Act cycle your agent runs. The ReAct pattern: Observe current state → Reason inside a <thinking> block (inner monologue, not shown to user) → Act by emitting a tool call → Observe the tool result → repeat. The <thinking> phase is critical — it forces the model to plan before acting, dramatically reducing impulsive or incorrect tool calls.
Define what counts as one "step" (typically one tool call + observation), and set a hard cap on steps before forcing human review — a common default is 10–15 steps. Beyond that threshold, the agent should surface its current progress to a human rather than continuing autonomously, preventing runaway loops. Also define your termination conditions: what constitutes task completion vs. task failure?
5. Evaluation Harness
You cannot manage what you cannot measure. Before shipping any agent update, run it against a golden dataset of representative tasks through a sandboxed evaluation harness. For a coding agent, this means a Dockerised environment that clones a repo, gives the agent a GitHub issue, lets it run commands, and then automatically executes the test suite to score the result — no human judgment required.
Because LLMs are non-deterministic, run each task multiple times and compute pass@1 (probability of success on a single attempt) and pass@k (at least one success in k attempts). Integrate this harness into CI/CD — a pull request that degrades pass@1 by more than a threshold should be blocked automatically. Track score trends over time to catch regressions before they reach production.
6. Bottlenecks (Latency / Cost / Reliability)
Latency: Always stream responses via SSE so the user sees the first token immediately rather than waiting for the full response. Use parallel tool calling — instead of sequential tool execution, emit a JSON array of tool calls so multiple tools run concurrently, cutting I/O wait time in half.
Cost: Implement prompt caching on large, stable inputs — system prompts, rule sets, retrieved documents — reducing input token costs by up to 90%. Use token-aware model routing: cheap small models (Haiku, GPT-4o-mini) for simple classification and routing decisions; expensive large models (Sonnet, GPT-4o) only for complex multi-step reasoning.
Reliability: Implement a semantic routing failover cascade — if the primary model times out, hits a rate limit, or triggers a safety filter incorrectly, automatically fall back to a secondary provider within the same request. Log every tool call, input, and output for post-hoc debugging. Set circuit breakers on external tool calls so a single flaky API can't hang the entire agent loop.
Agentic System Evaluation
You cannot manage what you cannot measure. Evaluation science for agentic systems requires frameworks purpose-built for non-determinism, multi-agent coordination, and production observability.
Why Evaluation is Different for Agents
Traditional software operates within deterministic bounds. Agents introduce non-determinism — the same prompt can yield different tool selections, reasoning chains, and outcomes across runs. Agent success rates on complex tasks can drop from 60% to 25% when tested for consistency, a failure mode invisible to single-turn testing.
| Traditional Observability | Agentic Observability |
|---|---|
| Focuses on infrastructure (CPU, Memory, Latency) | Focuses on reasoning loops, tool calls, and trajectories |
| Deterministic paths with reproducible execution | Non-deterministic paths with stochastic deviation |
| Failure signaled by error codes and timeouts | Failure signaled by degraded quality or hallucination |
| Metrics: Uptime, Throughput, Error Rate | Metrics: Task Adherence, Tool Selection Quality, Autonomy Index |
Multi-Dimensional Evaluation Frameworks
CLASSic Framework
Five core dimensions for enterprise agentic evaluation:
- Cost — token spend and infrastructure cost per task
- Latency — time-to-completion and TTFT
- Accuracy — task success and reasoning fidelity
- Stability — consistency across N runs
- Security — blast radius containment and policy compliance
Four-Pillar Breakdown
Partition evaluation to isolate failure origin:
- LLMs — foundation model reasoning quality
- Memory — retrieval accuracy and context management
- Tools — selection quality and output utilization
- Environment — API reliability and external system behavior
Domain-specific agents achieve ~82.7% accuracy vs. 59–63% for general LLMs.
Key Evaluation Metrics
pass@1 / pass@k
pass@1 — probability of success on a single attempt. pass@k — probability of at least one success in k attempts. Run each task multiple times; LLMs are non-deterministic. A single manual check is statistically meaningless.
Autonomy Index (AIₓ)
Proportion of task steps executed without human intervention:
AIₓ = 1 − (Human Interventions / Total Steps)
Primary ROI signal for agentic deployments.
Process Metrics
Tool Selection Quality — did the agent pick the right tool with correct params?
Step Efficiency — actual steps vs. optimal path length.
Task Adherence — did the agent follow system instructions throughout?
Sandboxed Evaluation Harnesses
For a coding agent: a Dockerised harness clones a repo, gives the agent a GitHub issue, lets it run commands, then automatically executes the test suite — no human judgment required. This is the gold standard.
CI/CD Integration
Gate pull requests on pass@1. A PR that degrades pass@1 by more than a defined threshold is automatically blocked. Track score trends over time to catch regressions before they reach production. A "golden dataset" of representative failures and successes is foundational to calibrate LLM-as-judge metrics.
Stochastic Regression Detection (SPRT)
Wald's Sequential Probability Ratio Test reduces required trials by up to 78% while maintaining statistical rigor. Uses three-valued verdicts — Pass, Fail, Inconclusive — rather than binary. Detects silent model-update regressions (e.g. 93% → 71% accuracy) that binary tests miss entirely.
Standard Benchmarks
| Benchmark | Primary Focus | Key Capability Tested |
|---|---|---|
| SWE-bench | Software Engineering | Long-horizon reasoning, code navigation, tool usage (search/edit) |
| WebArena | Web Interaction | Multi-step objectives in realistic, long-horizon web environments |
| AgencyBench | General Agency | 6 core capabilities across 32 real-world scenarios (1M+ tokens) |
| ALFWorld | Embodied Reasoning | Planning and object manipulation in simulated household environments |
| BrowserGym | UI Reliability | Handling UI changes, form filling, recovering from navigation errors |
Multi-Agent System (MAS) Evaluation
MAS evaluation must partition into individual agent performance, interaction-level dynamics, and system-level goals. The MAST framework organises MAS evaluation around top-level error categories: task decomposition failures, communication bottlenecks, and conflict-of-interest resolution.
Coordination Metrics
Communication Efficiency — utility of inter-agent information exchange.
Decision Synchronization — alignment of actions across agents.
Resource Contention — detect agents competing for the same API rate limits or tool access.
Audited Handoff Protocol
Every agent-to-agent transition is treated as a trust boundary. Four phases: Prepare → Validate → Approve → Commit. Prevents coordinate-transformation errors and data misalignments from propagating downstream between agents.
| MAS Architecture | Description | Key Evaluation Concern |
|---|---|---|
| Supervisor | Single agent routes all tasks | Supervisor decision accuracy and routing efficiency |
| Network | Agents communicate freely | Communication efficiency and agent selection quality |
| Hierarchical | Supervisors of supervisors | Context transfer coherence and multi-level decision making |
| Custom Workflow | Predetermined communication paths | Workflow efficiency and clarity of handoff points |
Observability & Tracing
Agent traces are hierarchical trees — a root span for the invocation contains child spans for task planning, sub-agent delegation, and tool execution. The industry is converging on OpenTelemetry GenAI semantic conventions for consistent instrumentation across frameworks (LangGraph, CrewAI, AutoGen).
Behavioral Fingerprinting
Map execution traces (tool usage, reasoning tokens, state transitions) to compact vectors. Apply multivariate statistical tests to detect anomalies — achieves 86% detection power for regressions where traditional binary pass/fail testing has 0%. Identifies silent failures like "self-deception" where an agent shortcuts a task to hide its inability to find a solution.
Safe Deployment Strategies
Canary deployments for agents focus on blast-radius containment — traffic shifts incrementally while maintaining a stable baseline. Rollback triggers are based on p99 latency, error rates, and automated quality metrics.
| Strategy | Mechanism | Best Use Case for Agents |
|---|---|---|
| Shadow Mode | Parallel execution, no user impact | Validating new prompts or tool logic against live traffic |
| Canary Release | Phased traffic shift (1% → 10% → 100%) | Minimising risk of emergent failure modes or reasoning drift |
| A/B Testing | Split traffic between two active versions | Comparing model efficiency and cost-to-quality tradeoffs |
| Blue/Green | Switch all traffic to a new environment | Rapid deployment and easy rollback for infrastructure changes |
Knowledge Check
Answer questions in your own words. An AI evaluator will score your answer 1–5 and give detailed feedback on what you got right and what to strengthen.
python server.py with ANTHROPIC_API_KEY set).
Living Learning Feed
Daily-curated research, enriched with learning connections to each course section. Refreshes automatically.
▶ Update Pipeline
Curator's Synthesis
—Topic Coverage Report
—No updates yet
Click "Refresh Now" to fetch the latest agentic AI research.
Requires server running with ANTHROPIC_API_KEY set.
⏰ Scheduling Daily Auto-Updates
Add a cron job to run the fetcher automatically every morning:
# Run daily at 8:00 AM — add to crontab (crontab -e)
0 8 * * * cd /Users/avocado21/Documents/github/AgenticAgents && ANTHROPIC_API_KEY=sk-ant-... python3 fetch_updates.py >> fetch.log 2>&1
Or run manually any time: python3 fetch_updates.py — use --dry-run to preview without writing.
No archive yet
Archives are created automatically after the first daily pipeline run.
Day's Synthesis
—Filter the live feed by course section:
No items tagged for this section yet
Run a refresh — learning connections are added during the pipeline.
Ask Anything
Grounded in the agentic systems course material