Building Scalable AI Agents: A Journey Through Multi-Agent Architecture
Or: How I Learned to Stop Worrying and Love the Token Budget
Introduction
Remember when you thought building an AI agent would be easy? “Just throw some prompts at GPT-4 and call it a day,”.
“What could go wrong?” you said.
Narrator: Everything went wrong.
Large Language Models have revolutionized intelligent applications, but here’s what nobody tells you at those fancy AI conferences: scaling an AI agent from “cool demo” to “production system that doesn’t bankrupt your company” is… challenging. Token costs spiral like a startup’s cloud bill, context windows overflow faster than your coffee cup on Monday morning, and response times make dial-up internet look speedy.
This is our journey (currently in progress, bugs included) from a basic implementation to building a production-ready, cost-efficient multi-agent system for QuestionPro’s BI platform. We’re exploring three key patterns, making mistakes in real-time, and documenting everything so you don’t have to. Think of this as “Mythbusters” but for AI architecture, with 100% more token optimization and slightly fewer explosions.
Part 1: The Basic Agent - Or “How Hard Could It Be?”
The Innocent Beginning
Like every developer who’s just discovered LangGraph, we started with the “obvious” approach, THE MONOLITHIC AGENT: one agent to rule them all, one agent to find them, one agent to bring them all, and in the darkness bind them (to a $10,000/month OpenAI bill… roughly).

// Single agent with all capabilities
const agent = new ChatOpenAI({ model: "gpt-4" }).bindTools([
createDashboard,
getDashboard,
createWidget,
listSurveys,
getQuestions,
// ... 40+ more tools
]);
const systemPrompt = `
You are a dashboard analytics assistant.
Available dashboards: [... 500 lines of context ...]
Survey schemas: [... 2000 lines of schemas ...]
Widget configurations: [... 1500 lines of configs ...]
`;
The “Oh No” Moment
You know that feeling when you check your AI bill and your heart skips a beat? That was us, thinking about going into “production”.
Our “Everything is Fine” Metrics:
📊 The Uncomfortable Truth:
- Avg tokens per request: 15,000 (narrator: this is not good)
- Cost per conversation: $1.50 (multiply by users... *sweats*)
- P95 latency: 8 seconds (users hate us)
- Projected monthly costs: $4,500 (CFO hates us more)
- Success rate: 89% (11% of the time, it works every time!)
🔴 The "We Need to Talk" Problems:
- Context overflow (GPT-4 politely hanging up on us)
- Tool selection paralysis (like Netflix, which movie to watch)
- Exponential cost growth
- LLM having an existential crisis (overwhelmed with 40+ tools)
The Exponential Growth Problem (Or: Math is Unforgiving)
Turn 1: 12,500 tokens ✅ "This is fine"
Turn 5: 22,000 tokens ⚠️ "This is less fine"
Turn 10: 37,000 tokens ❌ "This is fire, everything is fire"
We were literally pricing ourselves out of business, one helpful conversation at a time.
The real kicker? We kept asking “Can we just add more features?” while totally unaware of the bill that might get generated.
Note: These numbers are rough estimates from our experiments and paper napkin calculations. We’re still actively developing and iterating, but gives an idea to think in the right direction.
Part 2: The Skills Pattern - Progressive Disclosure
The “Aha!” Moment
Or: How We Discovered That Laziness is Actually a Virtue
After three existential crises where we admitted our agent was hemorrhaging money, we stumbled upon a life-changing concept:
“What if… and hear me out… we DON’T load everything into the prompt like we’re packing for a 2-week vacation?”
Revolutionary, I know. We felt like we’d discovered fire, except fire was actually “reading the documentation properly.”
Enter progressive disclosure - a fancy term for “only fetch stuff when you actually need it,” which is basically how every efficient human operates but somehow felt groundbreaking when applied to AI. (Yes, we’re aware of the irony.)
“Don’t load everything upfront. Load only what you need, when you need it.”
Instead of including all context in every request, we implemented progressive disclosure - the agent loads specialized knowledge through the skills pattern.

Implementation
// Lightweight system prompt with skill metadata only
const systemPrompt = `
You are a dashboard analytics assistant.
Available Skills (load when needed):
- survey_schema: Get survey questions and field types
- widget_config: Get chart-specific settings
- dashboard_filters: Get filter options
- styling_themes: Get appearance options
Current context:
- Workspace: 1
- Dashboard: ${state.dashboardId || "none"}
`;
// Skills load on-demand
const loadSkillTool = tool(async ({ skillName, resourceId }) => {
switch (skillName) {
case "survey_schema":
return await fetchSurveySchema(resourceId);
// Returns ~2,000 tokens only when needed
case "widget_config":
return await fetchWidgetConfig(resourceId);
// Returns ~1,500 tokens only when needed
}
});
Skills in Action
User: “Create a gender chart from Customer Survey”
Request 1: Agent sees lightweight prompt
├─ System: 1,000 tokens (was 8,000)
├─ Tools: 1,500 tokens
└─ Messages: 2,000 tokens
Total: 4,500 tokens ✅
Agent decides: "I need the survey schema"
└─ Calls: load_skill("survey_schema", 12345)
Request 2: Agent receives survey details
├─ Previous context: 4,500 tokens
├─ Survey schema: 2,000 tokens
└─ Total: 6,500 tokens ✅
Agent creates chart with correct data mapping
The Results (Or: When Theory Meets Reality and Actually Works)
📊 Skills Pattern Metrics (We Were Shocked Too):
- Avg tokens: 6,500 (was 15,000) 🎉
- Cost per conversation: $0.65 (was $1.50) 💰
- P95 latency: 4s (was 8s) ⚡
- Monthly costs: $1,950 (was $4,500) 🎊
✅ 57% cost reduction (CFO can sleep better)
✅ 50% faster responses (users will hate us a little less now)
✅ Stable token usage (no more exponential nightmares)
✅ Team morale: Significantly improved
As this was my late night hustle when no one was around, I just high fived my wall.
### But Wait... (The Plot Thickens)
Just when we thought we'd solved AI, reality decided to humble us. Again.
❌ The "Not So Fast" Problems:
1. Agent handling 40+ tools → Like asking someone to pick
their favorite child, but with 40 children
2. Mixed concerns → Debugging became "which of these
40 things broke?"
3. No parallelization → Everything sequential because
apparently we hate speed
4. Some skills returned 3,000+ tokens → The token diet
didn't last long
The Comedy of Errors:
User: "Style my dashboard in dark mode."
(A simple, reasonable request.)
Agent's internal monologue:
"Hmm, I have 40 tools. Let me read each description...
create_dashboard? No...
update_dashboard? Maybe?
create_widget? Probably not...
update_theme? OH WAIT THAT'S THE ONE! ✅
...only took me 3 seconds to figure that out"
User: _has already rage-quit_
Part 3: Multi-Agent Architecture - Specialized Experts
The “Why Didn’t We Think of This Sooner?” Moment
The Breakthrough (Cue Dramatic Music)
Picture this: It’s 2 AM, you’re on your third coffee, scrolling through LangGraph documentation for the millionth time, when suddenly…
“What if… what if… we don’t make one agent do EVERYTHING? What if we have specialized agents? Like… real companies?”
🤯 Mind. Blown.
We felt like we’d invented the wheel, despite the fact that humans have been organizing into specialized roles since, like, the dawn of civilization. But hey, better late than never!
Our New Squad:
- Dashboard Agent: The organized one who actually reads the manual
- Widget Agent: The creative type, probably went to art school
- Datasource Agent: The data nerd (affectionately), speaks fluent SQL
- Styling Agent: Fashion police of the digital world
Basically, we went from having one overworked, stressed-out agent having a breakdown, to a healthy work environment with proper delegation. Revolutionary.
The Breakthrough
“What if instead of one agent doing everything, we had specialized experts?”
Like a company with departments (Sales, Engineering, Design), we created specialized agents:
- Dashboard Agent: Dashboards and tabs expert
- Widget Agent: Visualizations expert
- Datasource Agent: Data and surveys expert
- Styling Agent: Themes and appearances expert

Federal Router Implementation
// Lightweight orchestrator
const federalAgent = new ChatOpenAI({ model: "gpt-4" }).bindTools([
routeToDashboardAgent,
routeToWidgetAgent,
routeToDatasourceAgent,
routeToStylingAgent,
]);
const federalPrompt = `
You are a routing assistant. Delegate to specialists:
- Dashboard operations → Dashboard Agent
- Widget creation/editing → Widget Agent
- Data selection → Datasource Agent
- Styling/themes → Styling Agent
You route and synthesize - you don't execute tasks.
`;
Specialized Agent Example
// Dashboard Agent - Only 6 focused tools
const dashboardAgent = new ChatOpenAI({ model: "gpt-4" }).bindTools([
createDashboard,
getDashboard,
updateDashboard,
createTab,
updateTab,
deleteDashboard,
]);
Conversation Flow
User: “Create sales dashboard with gender chart from Customer Survey”
┌────────────────────────────────────────┐
│ 1. Federal Router │
│ Tools: 4 routing (400 tokens) │
│ Decision: Dashboard → Data → Widget │
│ Cost: 1,400 tokens │
└────────────────────────────────────────┘
↓
┌────────────────────────────────────────┐
│ 2. Dashboard Agent │
│ Tools: 6 dashboard (600 tokens) │
│ Creates: "Sales" dashboard (ID 100) │
│ Cost: 1,400 tokens │
└────────────────────────────────────────┘
↓
┌────────────────────────────────────────┐
│ 3. Datasource Agent │
│ Tools: 10 data (1,000 tokens) │
│ Finds: Survey 12345, Question 67890 │
│ Cost: 2,200 tokens │
└────────────────────────────────────────┘
↓
┌────────────────────────────────────────┐
│ 4. Widget Agent │
│ Tools: 8 widget (800 tokens) │
│ Creates: Pie chart widget │
│ Cost: 2,800 tokens │
└────────────────────────────────────────┘
Total workflow: 8,200 tokens (vs 60,000)
The Results (Holy Sh— I Mean, Wow!)
📊 The Numbers Don't Lie (But We Triple-Checked Anyway):
- Avg tokens per workflow: 8,200 (was 60,000) 📉
- Cost per workflow: $0.082 (was $0.60) 💸
- P95 latency: 3.5s (was 8s) 🚀
- Projected monthly costs: $780 (was $4,500) 🎯
- Success rate: 97% (was 89%) ✨
✅ 86% token reduction (not a typo!)
✅ 86% cost savings (CFO wants to buy us lunch)
✅ 56% faster responses (users think we upgraded servers)
✅ Clear debugging (we now know WHICH agent to blame)
✅ Team happiness: Through the roof
✅ Sleep quality: Significantly improved
At this point, we were pretty sure we’d reached peak performance. We were wrong.
Part 4: The Hybrid Approach - Maximum Efficiency
Peak Engineering
Or: What Happens When You Combine Your Two Good Ideas
After achieving what we thought was AI nirvana with multi-agents, I had a dangerous thought:
“Hey… what if we combine multi-agents WITH the skills pattern?”
It was an idea that precedes either genius or disaster. (In software engineering, these are often the same thing.)
But then we realized: Why choose between our two good ideas when we can have BOTH?
Multi-agents solved the “too many tools” problem, but we could still use skills to optimize even further. It’s like discovering that peanut butter AND jelly make a better sandwich together. Revolutionary? No. Delicious? Absolutely.
Combining the Best of Both
Multi-agents solved tool confusion, but skills can optimize even further. Each specialized agent uses skills for deep context.
// Widget Agent with Skills
const widgetAgent = new ChatOpenAI({ model: "gpt-4" }).bindTools([
// Core tools (lightweight, always bound)
createWidget,
updateWidget,
deleteWidget,
// Skill loader (heavy context on-demand)
loadSkill,
]);
const widgetPrompt = `
You are the Widget Agent - visualization expert.
When you need details:
- Chart config → load_skill("bar_chart_config")
- Widget styling → load_skill("widget_styling", widgetId)
- Data mapping → load_skill("data_mapping_rules")
Current: Dashboard ${state.dashboardId}, Tab ${state.tabId}
`;

Hybrid in Action
User: “Create a pie chart showing age distribution”
Federal Router → Datasource Agent → Widget Agent
Datasource Agent:
├─ Loads: survey_schema skill (2,000 tokens)
├─ Finds age question
└─ Returns structured data
Widget Agent:
├─ Loads: pie_chart_config skill (1,500 tokens)
├─ Creates widget with proper mapping
└─ Total: 6,200 tokens ✅
Vs. loading everything upfront: 18,000 tokens ❌
The Metrics
After implementing the hybrid approach and running it through our test suite (read: throwing everything at it to see what breaks), we got these numbers. We didn’t believe them at first either.
📊 The "Are We Sure This Is Right?" Results:
Token Usage (The Good Kind of Reduction):
├─ Simple queries: 2,500 (was 12,000) → 79% reduction 🎯
├─ Medium queries: 6,000 (was 35,000) → 83% reduction 💪
└─ Complex workflows: 12,000 (was 80,000) → 85% reduction 🚀
Cost Per User/Month:
├─ Light users: $0.25 (was $1.20) → 79% savings
├─ Regular users: $1.50 (was $8.75) → 83% savings
└─ Power users: $6.00 (was $35.00) → 83% savings
(Power users can stay, we can afford them now!)
Performance (Users Think We're Wizards):
├─ P50 latency: 1.8s (was 4.2s) → 57% faster ⚡
├─ P95 latency: 3.5s (was 8.7s) → 60% faster ⚡⚡
└─ P99 latency: 5.2s (was 15s) → 65% faster ⚡⚡⚡
Reliability (Actually Impressed Ourselves):
├─ Success rate: 97% (was 89%) → +8% improvement
├─ Tool selection: 99% (was 85%) → +14% improvement
└─ Context overflow: 0.1% (was 8%) → Basically extinct
💰 Projected Monthly Costs: $780 (was $4,500)
💰 Annual Savings: $44,640
💰 Engineer Stress Levels: Down 90%
☕ Coffee Consumption: Actually decreased
😴 Sleep Quality: Markedly improved
Choosing Your Pattern: The Decision Tree
The “Which One Do I Actually Need?” Decision Tree
Or: Saving You From Our Mistakes
Look, we tried ALL the things so you don’t have to. Here’s our hard-won wisdom, gained through tears, tokens, and too much coffee:
Final Architecture Overview
┌─────────────────────────────────────────────┐
│ Federal Router Agent (GPT-3.5) │
│ 4 routing tools (400 tokens) │
│ Fast, cheap classification │
└─────────────────────────────────────────────┘
│
┌───────────┼───────────┬──────────┐
↓ ↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│Dashboard │ │ Widget │ │Datasource│ │ Styling │
│ (GPT-4) │ │ (GPT-4) │ │ (GPT-4) │ │(GPT-3.5) │
├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤
│6 tools │ │8 tools │ │10 tools │ │6 tools │
│+ Skills: │ │+ Skills: │ │+ Skills: │ │+ Skills: │
│ filters │ │ configs │ │ schemas │ │ themes │
│ layouts │ │ mapping │ │ datasets │ │ fonts │
│ rules │ │ styling │ │ stacks │ │ a11y │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
Lessons From the Trenches
What Actually Worked (Surprisingly)
1. Baby Steps Don’t Just Work, They’re Essential
We wanted to rebuild everything overnight. Our tech lead said “no.” We’re glad he did.
┌─────────────────────┬──────────────┬────────────────┐
│ Your Situation │ Pattern │ Expected Gain │
├─────────────────────┼──────────────┼────────────────┤
│ < 10 tools │ Basic Agent │ Keep simple │
│ Simple context │ │ │
├─────────────────────┼──────────────┼────────────────┤
│ 10-25 tools │ Skills │ 50-60% savings │
│ Large context │ │ │
│ Single domain │ │ │
├─────────────────────┼──────────────┼────────────────┤
│ 25-50 tools │ Multi-Agent │ 70-80% savings │
│ Multiple domains │ │ │
│ Clear separation │ │ │
├─────────────────────┼──────────────┼────────────────┤
│ 50+ tools │ Hybrid │ 80-85% savings │
│ Complex domains │ │ │
│ Deep context needs │ │ │
└─────────────────────┴──────────────┴────────────────┘

Hard Truths We Learned (The Expensive Way)
1. Tool Descriptions: Write Them Like Your Job Depends On It
Because your token budget literally does.
❌ Bad (Our First Attempt):
createWidget: {
description: "Creates a widget";
}
// LLM: "Cool story bro, but WHEN do I use this?"
✅ Good (After Many Painful Iterations):
createWidget: {
description: `Creates a visualization widget.
Use when: User asks to "add chart/graph/visualization"
Don't use for: Updating widgets (use update_widget)
// Yes, we're treating the LLM like a junior dev.
// Yes, it works.`;
}
Pro tip: Think of tool descriptions as writing documentation for the world’s most literal intern. Because that’s essentially what it is.
2. Monitor What Matters
Track more than just cost:
logger.info({
event: "agent_execution",
workflowName: "dashboard_creation",
metadata: {
agentType: "widget_agent",
skillsLoaded: ["bar_chart_config"],
tokensUsed: 3800,
},
});
3. Progressive Skill Rollout
Don’t implement all skills at once. Start with the most-used:
Priority 1: survey_schema (80% of requests)
Priority 2: bar_chart_config (60% of requests)
Priority 3: widget_styling (40% of requests)
Priority 4: dashboard_filters (30% of requests)
4. Don’t Over-Specialize (Learn From Our Hubris)
In our enthusiasm, we initially created 8 agents. EIGHT. We were basically creating an AI bureaucracy.
Original (Too Many):
├─ Dashboard Agent
├─ Tab Agent ❌ (seriously, tabs needed their own agent?)
├─ Widget Agent
├─ Widget Styling Agent ❌ (why did we do this to ourselves)
├─ Survey Agent ❌
├─ Dataset Agent ❌
├─ Analytics Agent ❌
└─ Export Agent ❌
After Therapy (Better):
├─ ✅ Dashboard Agent (includes tabs, we're not savages)
├─ ✅ Widget Agent (includes styling, it's fine)
├─ ✅ Datasource Agent (all data sources, one happy family)
└─ ✅ Styling Agent (themes + accessibility)
Sweet spot: 4-6 agents
More than that and you’re just creating coordination overhead. It’s like having too many group chats - eventually nobody knows what’s happening where.
Quick Wins That’ll Make You Look Like a Hero
(Seriously, Do These Today)
1. Stop Writing Novellas in Your System Prompts (30 minutes)
Your system prompt is not the place to explain your life story, company history, and philosophical stance on data visualization.
2. Return Structured Data from Skills
❌ Bad (text):
"The survey has gender as multiple choice and age as numeric...";
✅ Good (JSON):
{
questions: [
{ id: 67890, text: "Gender?", type: "multiple_choice" },
{ id: 67891, text: "Age?", type: "numeric" },
];
}
3. Don’t Over-Specialize
We initially had 8 agents, consolidated to 4:
- ✅ Dashboard Agent (includes tabs)
- ✅ Widget Agent (includes styling)
- ✅ Datasource Agent (all data sources)
- ✅ Styling Agent (themes + accessibility)
Lessons Learned: What We’ve Learned (So Far)
The Real Truth About Building AI Agents
Here’s what they don’t tell you in the glossy blog posts and conference talks:
Building scalable AI agents isn’t about having the most sophisticated architecture from day one. It’s about:
- Start simple - Basic agent for MVP (it’s okay, we all started here)
- Measure everything - If you’re not tracking tokens, you’re flying blind
- Fail fast, learn faster - We made EVERY mistake so you don’t have to
- Evolve incrementally - Skills → Multi-Agent → Hybrid (this is the way)
- Focus on user value - Fast, accurate, reliable (novel concept, we know)
Our Journey in Numbers (The Before/After You’ve Been Waiting For)
Where We Started (The Dark Times):
├─ Cost: $4,500/month (gulp)
├─ Tokens: 15k per request (yikes)
├─ Success: 89% (not great, Bob)
└─ Team morale: Low
Where We're At Now (The Good Times):
├─ Cost: $780/month (manageable!)
├─ Tokens: 6.6k per request (sustainable)
├─ Success: 97% (actually good!)
└─ Team morale: High
What’s Next (Or: What We’re Learning Next)
As we continue building and scaling QuestionPro’s BI agent, here’s what we’re watching:
- Context management is king - Progressive disclosure isn’t optional, it’s survival
- Specialization is inevitable - One agent doing everything is like one person doing all jobs at a company (recipe for disaster)
- Hybrid is the sweet spot - Why choose between good ideas?
- Cost optimization is non-negotiable - It’s literally the difference between “cool AI feature” and “bankrupt company”
- Testing is hard - LLMs are non-deterministic. Our test suite has trust issues.
- Documentation matters - Future you will thank present you (we learned this the hard way)
Resources (That Actually Helped Us)
LangGraph Documentation:
- Multi-Agent Patterns (bookmark this, seriously)
- Progressive Disclosure / Skills Pattern (game changer)
- Examples that actually work (rare, treasure them)
Monitoring Tools We Actually Use:
- Langfuse (our current favorite)
- LangSmith (also good)
- Custom OpenTelemetry (for the brave)
- Coffee (monitors our alertness)
Calculate Your Potential Savings (Do This Right Now)
Seriously, take 2 minutes and see what you could be saving:
// Your current situation
const currentCost = ((monthlyRequests * avgTokens) / 1_000_000) * $10;
// What you COULD be saving
const potentialSavings = {
withSkills: currentCost * 0.5, // 50% reduction
withMultiAgent: currentCost * 0.7, // 70% reduction
withHybrid: currentCost * 0.85, // 85% reduction
};
// Now imagine what you could do with that money
// (We're thinking coffee budget + GPU upgrades)
Final Thoughts (The Honest Ones)
Look, we’re still figuring this out. We’re still learning. We’re still making mistakes (just more expensive ones now that we know better). But that’s the point… nobody has this completely figured out. AI is moving faster than documentation can keep up.
What we DO know:
- Start simple, evolve based on actual problems
- Measure EVERYTHING (seriously, token tracking is non-negotiable)
- Don’t over-engineer (we did this so you don’t have to)
- Community knowledge is invaluable (thank you, internet strangers)
- It’s okay to not know everything (we certainly don’t)

The Future
As agents scale:
- Context management is king - Progressive disclosure wins
- Specialization is inevitable - One agent can’t do it all
- Hybrid is the sweet spot - Combines best of all patterns
- Cost optimization is non-negotiable - Make or break for production