I Spent a Weekend Building AI Agents. Most of Them Were Useless.

After reading about AI agents for months — Anthropic’s research, the CrewAI hype, the LangGraph discourse on Twitter — I blocked off a weekend to actually build some. Three experiments, two frameworks, one honest assessment.

Experiment 1: Content Pipeline (CrewAI)

The idea: a team of agents that researches a topic, writes an article, edits it, and fact-checks the result. Four agents with distinct roles.

researcher = Agent(
    role="Research Analyst",
    goal="Find accurate, current information on the topic",
    tools=[web_search, arxiv_search],
    llm="claude-sonnet-4-5-20250514"
)

writer = Agent(
    role="Technical Writer",
    goal="Write clear, engaging content from research findings",
    llm="claude-sonnet-4-5-20250514"
)

Result: Actually decent. The researcher agent found relevant papers and blog posts. The writer produced a reasonable draft. The editor caught some awkward phrasing and structural issues. The fact-checker flagged two claims that were outdated.

But: The whole pipeline took 4 minutes and cost about $0.35 for a single article. A single well-crafted prompt to Claude takes 30 seconds and costs $0.03. The multi-agent version was better maybe 70% of the time. For the other 30%, the single prompt was just as good.

The content pipeline is the best use case I’ve found for multi-agent systems. The role separation genuinely helps because research, writing, and editing are fundamentally different cognitive tasks.

Experiment 2: Code Review Pipeline (LangGraph)

I tried building a more sophisticated code review system using LangGraph’s graph-based approach: static analysis agent → security reviewer → performance auditor → summary generator.

workflow = StateGraph(ReviewState)
workflow.add_node("static_analysis", static_analysis_agent)
workflow.add_node("security_review", security_review_agent)
workflow.add_node("performance_audit", performance_audit_agent)
workflow.add_node("summarize", summary_agent)

Result: Worse than my single-agent code review bot. Each specialized agent was less capable than one general-purpose agent because they had narrower context. The security reviewer kept flagging things that the static analyzer already caught. The summarizer sometimes lost important details from earlier agents.

Lesson: Multi-agent doesn’t automatically mean better. If a single agent with good instructions can handle the task, adding more agents just adds latency, cost, and potential for information loss between handoffs.

Experiment 3: Research Assistant (CrewAI + Tools)

A research assistant that searches the web, reads papers, synthesizes findings, and generates a briefing document. This is where tools make agents interesting — they can actually take actions, not just generate text.

Result: This was the one that surprised me. The agent found papers I wouldn’t have found myself, made connections between them that I hadn’t considered, and produced a research briefing that genuinely saved me time. The key difference: this task is naturally multi-step and benefits from iterative tool use.

What I Actually Learned

When Multi-Agent Works

Tasks where role separation maps to genuinely different skills (research vs. writing vs. editing)
Workflows that need iterative tool use over multiple steps
Situations where having a “critic” agent improves output quality

When It Doesn’t

Tasks that a single well-prompted agent can handle
When agents need heavy context-sharing (information gets lost in translation)
When latency matters — every agent adds a full LLM round-trip
When cost matters — 4 agents × 3 calls each = 12 API calls per task

Framework Comparison

CrewAI is easier to get started with. Define roles, define tasks, run the crew. If your workflow is sequential (A → B → C), CrewAI gets out of your way.

LangGraph gives you more control. Conditional edges, loops, state management. If your workflow needs branching (“if the security review finds critical issues, stop and alert; otherwise continue”), LangGraph handles it better.

I use CrewAI for content workflows and LangGraph when I need more complex orchestration. They’re not competing — they solve different problems.

The Economics

A 4-agent pipeline costs roughly 4-8x more than a single-agent call for the same task. Sometimes that’s worth it. For a content pipeline that runs twice a week, the quality improvement justifies $0.35 per article. For a code review that runs 20 times a day, it doesn’t.

My Honest Take on the Agent Hype

AI agents are powerful and the technology is real. But the discourse makes it sound like everything should be multi-agent. It shouldn’t. Most tasks that people build agent systems for could be handled by a single LLM call with better prompting.

Start with one agent. If it’s not good enough, figure out why. Is it a reasoning problem? Add chain-of-thought. Is it a context problem? Add retrieval. Is it genuinely multiple different skills? Then maybe add agents.

The boring answer is almost always the right one: write a better prompt first, add agents later.