What are AI agents? AI agents are large language model (LLM) systems that pursue goals through a reasoning loop: they observe a situation, plan steps, call tools, observe the results, and adjust their next action accordingly. Where a chatbot responds to a prompt and stops, an agent keeps going until the task is done or it determines it cannot complete the task. By May 2026, AI agents have moved from research curiosity to operational reality. The Brookings Institution describes the shift as one of the most consequential developments in applied AI, with both the upside (autonomous completion of substantive knowledge work) and the downside (a new class of reliability and safety problems) actively unfolding.
This is the first post in our AI Agents series. It covers what an agent actually is, the core capabilities that distinguish an agent from a chatbot, the major framework ecosystem, where agents earn their place in business operations, and the honest limitations that mean roughly 40% of agentic AI projects will be canceled by the end of 2027 per Gartner’s forecast. Subsequent posts in the series cover frameworks, building patterns, evaluation, failure modes, security, and vertical use cases. For broader AI context, see our pieces on what artificial intelligence is and how machine learning works; for the model behind much of the current agent capability, our GPT-5.5 piece covers the foundation models that power agents.
What an AI agent actually is
An AI agent is software that combines an LLM (the reasoning engine), tools the LLM can invoke (web search, code execution, database queries, API calls), memory (working context plus optional persistent storage), and a control loop that lets the LLM act over multiple steps to complete a task. The agent observes its environment, decides what to do next, takes an action, observes the result, and continues until the task is done.
A concrete contrast. Asking ChatGPT "what was Microsoft’s revenue in Q1 2026" is a conversation with a chatbot. The model produces an answer (or a refusal, if the question is outside its training cutoff) and stops. Asking an agent "produce a summary of Microsoft’s Q1 2026 earnings, with revenue, segment breakdown, and the three most-discussed topics from the earnings call" is a different request. The agent will search for the earnings release, retrieve the document, extract the numbers, search for analyst commentary, synthesize the topics discussed, structure the output, and deliver a finished summary. The work involves multiple tool calls, multiple intermediate outputs, and a loop that keeps going until the agent has what it set out to produce.
The category of work an agent handles well shares specific characteristics:
- Goal-directed rather than prompt-directed: the user describes an outcome, not a sequence of steps.
- Tool-dependent: the work requires accessing external data, running code, interacting with systems beyond the LLM’s training knowledge.
- Multi-step: the task naturally decomposes into several sub-tasks that the agent plans and executes.
- Bounded but non-trivial: the agent can recognize completion criteria, but the path to those criteria is not predetermined.
The work an agent handles poorly: tasks that require real-world physical action without robotics, tasks requiring judgment that depends on context the LLM does not have, and tasks where the cost of an error is catastrophic and not easily verified.
The core capabilities that make an agent
Four capabilities, in roughly the order they were developed:
- Planning: the agent decomposes a goal into sub-tasks and orders them. Planning can be explicit (the agent generates a written plan before acting) or implicit (the model reasons step-by-step through the task as it goes). Modern frontier models handle planning natively; the explicit-plan pattern remains useful for complex tasks where the agent benefits from articulating the structure before executing.
- Tool use: the agent calls functions or APIs to interact with systems beyond the LLM itself. Web search, code execution, database queries, file system access, third-party APIs (Slack, Salesforce, GitHub, etc.) are common tools. The OpenAI function-calling pattern (introduced 2023) and Anthropic’s Model Context Protocol (MCP, 2024) are the major standards for agent-tool integration.
- Memory: agents need to track state across steps. Three categories of memory matter: working memory (the current conversation context, which fits in the model’s context window), episodic memory (recent interactions stored in a vector database for retrieval), and persistent memory (long-term storage of facts, preferences, learned patterns).
- Autonomy and reflection: how much human oversight the agent operates under. A fully supervised agent runs every action by a human first; a fully autonomous agent runs without human input within defined boundaries. Most production agents sit between these poles. Reflection (the agent evaluating its own progress and self-correcting) is a related capability that improves reliability over long-horizon tasks.
These four capabilities combine in different proportions across different agent types. A coding agent (Cursor, Cline) leans heavily on planning, tool use (file edits, terminal commands), and minimal autonomy (human approves each step). A customer-service triage agent leans on autonomy (handles routine cases without escalation), tool use (knowledge base lookup, ticket creation), and bounded memory (within-conversation context).
How AI agents differ from chatbots and assistants
The terms get blurred in marketing materials, but the distinctions matter:
- Chatbot: responds to a single prompt with a single response. May handle multi-turn conversation but each turn is a complete prompt-response cycle. Use cases: customer FAQ, conversational interfaces, simple Q&A.
- Assistant: an enhanced chatbot with some built-in capabilities (search, code execution, image generation) that the user can invoke through conversation. The assistant invokes capabilities when the user’s prompt triggers them. Use cases: ChatGPT consumer experience, Microsoft Copilot in Office, Google Gemini in Workspace. The user is in the loop for every action.
- Agent: completes a task autonomously across multiple steps, deciding which tools to use and when, with bounded human oversight. The user describes the goal; the agent decides the path. Use cases: coding agents that complete pull requests, customer-service agents that resolve full tickets, research agents that produce finished reports.
The line between an "assistant with tools" and an "agent" is fuzzy in practice. The distinguishing characteristic is whether the system takes multiple actions toward a goal without prompting between each one. If yes, it is an agent. If the human approves each step, it is closer to an assistant.
The AI agent framework landscape in 2026
Several open-source frameworks dominate the developer landscape:
- LangGraph: built by the LangChain team, focused on explicit state and control flow. Surpassed CrewAI in GitHub stars in early 2026, driven by enterprise adoption. Graph-based architecture maps cleanly to production requirements like audit trails, rollback points, and conditional routing. Positioned to own the production / enterprise tier.
- CrewAI: role-based multi-agent framework. Optimized for quickly assembling crews of agents with defined roles and process types. Teams often start with CrewAI for prototyping and migrate to LangGraph when they need production-grade state management.
- AutoGen: Microsoft’s framework, focused on conversational agent interaction. Implements conversational agent teams where agents interact through multi-turn conversations with a selector determining who speaks next.
- Semantic Kernel: Microsoft’s broader orchestration framework with agent support. Common choice in Microsoft-centric enterprise environments.
- Vendor SDKs: OpenAI’s Agents SDK, Anthropic’s Claude Agent SDK, and similar offerings from each major lab provide vendor-specific agent tooling that often integrates cleanly with the lab’s frontier model.
Beyond frameworks, the major foundation-model labs ship purpose-built agents:
- Coding agents: OpenAI Codex (powered by GPT-5.5; see our Daybreak coverage for the cybersecurity-specific application), Anthropic Claude Code, Cursor (third-party IDE-integrated agent), Cline (community fork), GitHub Copilot (agent mode, 2025).
- Browser agents: OpenAI Operator, Anthropic Computer Use, and various third-party browser-control agents.
- Specialized vertical agents: cybersecurity agents (Daybreak, Project Glasswing), research agents, customer-service agents from Intercom and Zendesk.
A common production pattern is using CrewAI or similar role-based frameworks for the research and synthesis phase, then passing a structured object to LangGraph for the execution phase with deterministic state management and human-in-the-loop checkpoints. The "one framework for everything" approach has largely given way to "the right framework for each phase."
Where AI agents earn their place
Four categories of work have produced reliable agent deployments by 2026:
- Software development: the highest-signal area. Coding agents complete pull requests, generate tests, refactor code, navigate large codebases. SWE-bench Verified scores in the 70%+ range for frontier models on real GitHub issue resolution show the capability is real. Production teams use Cursor, Cline, Claude Code, and Codex in daily workflows.
- Customer service triage: agents handle first-line questions, route complex issues, and resolve routine tickets. Quality varies by deployment, but well-built agents now handle 30–60% of inbound support volume in many B2C contexts without human escalation.
- Research and data analysis: agents conduct multi-source research, synthesize findings, and produce structured outputs. The pattern works well for narrow questions with abundant source material; it works less well for questions requiring expert judgment.
- Operations automation: agents perform routine IT operations (provisioning, incident triage, log analysis), data pipeline maintenance, and other operational work that previously required scripted automation plus human oversight. DevOps agents are the active frontier.
The pattern across these successful deployments: bounded scope, abundant feedback signals, recoverable errors, and human verification for high-stakes outputs. Agents fail in production where these conditions don’t hold.
The honest limitations
Three reality checks are worth surfacing:
The failure rate is non-trivial. Gartner forecasts that more than 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. The hype-to-production translation has been harder than the marketing implied.
Failure modes are subtle. Agent runs span hundreds of steps with each depending on prior tool calls and memory; failures rarely break at the obvious moment. A bad assumption at step 3 can quietly contaminate step 50, with the agent remaining wrong for extended periods without noticing. Common failure modes documented in the Failure Modes in Agentic AI workshop at ICML 2026 include context drift, knowledge attrition, infinite loops, circular reassurance patterns, and latency abuse. Detection of these failures is harder than the failures themselves.
Evaluation is genuinely difficult. Most established benchmarks (SWE-bench, GAIA, WebArena, AgentBench) assess short time horizons of minutes to hours, while deployed agents may sustain tasks for hours or days with projections of week-long autonomous operation. Systematic methods for evaluating long-term reliability remain largely unexplored. Vendor-reported benchmark numbers often do not translate to production performance on your specific workload.
The realistic posture for businesses adopting agents in 2026 is "production-pilot, not production-deployment-at-scale." The technology is real and useful in specific applications; the operational discipline to deploy it safely at scale is still maturing.
What this means for your business
For business operators evaluating AI agents in 2026, three practical guidelines:
- Start narrow: pick a specific, bounded use case where errors are recoverable and humans can verify outputs at reasonable cost. “Generate the first draft of our weekly metrics report” is bounded. “Replace our support team with agents” is not.
- Invest in evaluation before scale: build the evaluation harness alongside the agent, not afterward. The pattern that works is “small agent, comprehensive evals, learn what fails, iterate.” The pattern that fails is “ship the agent, hope it works.”
- Plan for the human-in-the-loop reality: fully autonomous agents are not the typical 2026 production pattern. Plan for human verification of high-value outputs, escalation paths for failure cases, and feedback loops that improve the agent over time.
The next post in this series (AI Agent Frameworks Compared) covers the decision tree for choosing between LangGraph, CrewAI, AutoGen, and the vendor SDKs. Subsequent posts cover building your first agent, the failure modes worth knowing, evaluation patterns, and specific vertical use cases.
Frequently Asked Questions
What’s the difference between an AI agent and ChatGPT?
ChatGPT is a chat product built on OpenAI’s models. Its default mode is conversational Q&A: you ask, it answers, you ask the next thing. ChatGPT has gained agent-like capabilities (web search, code execution, file analysis), making it closer to an assistant than a pure chatbot. A standalone AI agent goes further: it pursues a goal autonomously, takes multiple actions, and runs until the task completes. ChatGPT can host agentic workflows through specific features (custom GPTs, Operator); a purpose-built agent is typically more capable in its domain because it is designed around the specific task.
Are AI agents safe to deploy in production?
“Safe” depends on the deployment context. Agents in low-stakes contexts with human verification are deployed safely in many businesses today. Agents in high-stakes contexts (financial transactions, medical decisions, security-critical actions) require careful design, evaluation, and oversight. The pattern that has held across production deployments: bounded autonomy, comprehensive evaluation, human-in-the-loop for high-value actions, and detection systems for failure modes. The technology can be deployed safely; the operational discipline to do so requires investment that some businesses underestimate.
Which AI agent framework should I use?
For production deployments with substantial state management and audit-trail requirements, LangGraph is the dominant default in 2026. For role-based multi-agent prototypes, CrewAI provides faster initial development. For Microsoft-centric environments, Semantic Kernel and AutoGen fit the broader ecosystem. For straightforward single-agent applications using a specific foundation model, the vendor’s official agent SDK (OpenAI Agents SDK, Anthropic Claude Agent SDK) is often the cleanest starting point. The “best” framework depends on the team, the use case, and the production requirements.
How much does it cost to run an AI agent?
Agent costs scale with the number of tokens consumed per task, the model tier used, and the duration of agent runs. A simple agent using a mid-tier model might cost a few cents per task. A complex agent running for hours on frontier models can cost dollars to tens of dollars per task. The economics depend heavily on the specific workload; the patterns that reduce cost (using cheaper models for routine steps, caching, batching, careful prompt design) can drop the total cost by 50–80% versus naive implementations. Cost engineering is part of agent deployment.
Are AI agents going to replace knowledge workers?
The honest answer is “some categories of work, partially, over time.” Specific tasks that consist of bounded research, structured output, and recoverable errors face the most direct displacement pressure. Tasks combining those elements with judgment, relationship-building, or high-stakes decision-making are more likely to be augmented than replaced. The 2026 reality is closer to “agents do the bounded work; humans do the judgment work” than to “agents replace knowledge workers wholesale.” The shape will continue to evolve.








