How to Build an AI Agent with Tool Use

AI agents that can only generate text are limited. The real power emerges when agents can interact with external systems—querying databases, calling APIs, executing code, or manipulating files. Tool use transforms a language model from a text generator into an autonomous system that can complete multi-step tasks, but most implementations fail because developers treat tool calling as a simple feature addition rather than an architectural decision that affects reliability, cost, and control flow.

This guide walks through building production-ready AI agents with tool use capabilities. You'll learn how to design tool schemas that models can reliably invoke, implement execution sandboxes that prevent runaway costs and security issues, handle multi-step reasoning loops where agents chain multiple tool calls, and debug the failure modes that only appear when your agent operates autonomously. The patterns here come from building agents that process thousands of tool calls daily in production environments.

We'll cover the fundamental architecture patterns, tool definition strategies, execution safety mechanisms, and the specific edge cases you'll encounter when your agent needs to decide which tool to call, parse the results, and determine next actions.

Understanding Tool Use Architecture

Tool use in AI agents follows a request-response loop that extends the standard LLM completion cycle. The model generates a completion that includes structured tool call requests. Your application parses these requests, executes the specified functions with provided arguments, and returns results to the model. The model then continues reasoning with access to the tool outputs.

The critical architectural decision: synchronous vs asynchronous tool execution. In synchronous execution, your application blocks while tools run and immediately returns results to the model. This works for fast tools (API calls under 2 seconds, database queries) but creates problems when tools take longer. Asynchronous execution queues tool requests, returns immediately, and notifies the agent when results are ready—but this complicates the conversation state management because you need to pause and resume agent context.

Most production systems use a hybrid approach: synchronous execution for fast, deterministic tools and async execution for long-running operations like web scraping, video processing, or complex calculations. The decision point is typically 5 seconds—tools that complete faster than 5 seconds run synchronously to maintain conversation flow, while slower tools run async with a callback mechanism.

Key Insight: The model has no concept of execution time. It will request a tool that takes 30 minutes as casually as one that takes 30 milliseconds. Your architecture must handle timeout boundaries, partial results, and graceful degradation when tools fail—the model cannot do this for you.

Tool Call Protocol Standards

OpenAI, Anthropic, and open-source models use different protocols for tool calling. OpenAI's function calling uses a structured JSON format in the API response. Anthropic's Claude uses XML-style tool use blocks. Open-source models often rely on prompt engineering with specific formatting conventions. This fragmentation means you need an abstraction layer if you want to switch models or support multiple providers.

// OpenAI function calling format
{
  "role": "assistant",
  "content": null,
  "function_call": {
    "name": "get_weather",
    "arguments": "{\"location\": \"San Francisco\", \"unit\": \"celsius\"}"
  }
}

// Anthropic tool use format
{
  "role": "assistant",
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_01A09q90qw90lq917835lq9",
      "name": "get_weather",
      "input": {
        "location": "San Francisco",
        "unit": "celsius"
      }
    }
  ]
}

// Abstraction layer interface
interface ToolCall {
  id: string;
  name: string;
  arguments: Record;
}

function parseToolCalls(response: LLMResponse): ToolCall[] {
  // Provider-specific parsing logic
  if (response.provider === 'openai') {
    return parseOpenAITools(response);
  } else if (response.provider === 'anthropic') {
    return parseAnthropicTools(response);
  }
  // Fallback to prompt-based extraction for other models
  return parseFromText(response.content);
}

Building this abstraction layer early prevents vendor lock-in and simplifies testing. You can mock tool calls with a consistent interface regardless of which model your agent uses in production.

Designing Reliable Tool Schemas

The tool schema is what the model sees when deciding which tool to call. A poorly designed schema leads to incorrect tool selection, malformed arguments, and agents that get stuck in retry loops. The schema must be precise enough for the model to understand capabilities but flexible enough to handle natural language variations in how users express requests.

Schema Design Principles

Every tool needs a name, description, and parameter specification. The description is more important than developers realize—models use it for tool selection. A description like "gets weather" is ambiguous. Does it return current weather or forecast? For which location formats? What happens if the location is invalid?

// Weak tool definition
const weatherTool = {
  name: "get_weather",
  description: "Gets weather information",
  parameters: {
    type: "object",
    properties: {
      location: { type: "string" }
    }
  }
};

// Strong tool definition
const weatherTool = {
  name: "get_current_weather",
  description: "Returns current weather conditions for a specific location. Use this when the user asks about current weather, not forecasts. Accepts city names, addresses, or coordinates. Returns temperature, conditions, humidity, and wind speed.",
  parameters: {
    type: "object",
    properties: {
      location: {
        type: "string",
        description: "City name (e.g., 'San Francisco'), full address, or coordinates in 'lat,lng' format"
      },
      unit: {
        type: "string",
        enum: ["celsius", "fahrenheit"],
        description: "Temperature unit. Defaults to celsius.",
        default: "celsius"
      }
    },
    required: ["location"]
  }
};

// Tool implementation with validation
async function getCurrentWeather(location: string, unit: string = "celsius") {
  // Input validation - models can generate invalid inputs
  if (!location || location.trim().length === 0) {
    return {
      error: "Location is required and cannot be empty",
      success: false
    };
  }

  // Handle different location formats
  const coords = parseCoordinates(location) || await geocodeLocation(location);

  if (!coords) {
    return {
      error: `Could not find location: ${location}. Please provide a valid city name, address, or coordinates.`,
      success: false
    };
  }

  // Call weather API
  const weather = await weatherAPI.getCurrent(coords.lat, coords.lng, unit);

  return {
    location: location,
    temperature: weather.temp,
    conditions: weather.conditions,
    humidity: weather.humidity,
    wind_speed: weather.windSpeed,
    unit: unit,
    success: true
  };
}

The detailed description guides tool selection. The parameter descriptions with examples reduce malformed inputs. The structured error responses give the model actionable information when calls fail—it can retry with corrected parameters rather than getting stuck.

Parameter Validation Strategies

Models will generate invalid parameters. They'll pass strings where numbers are expected, omit required fields, or hallucinate parameter names that don't exist. Your tool implementation must validate inputs and return structured errors that help the model correct its mistakes.

Validation Type	Common Failures	Handling Strategy
Type Validation	String instead of number, array instead of object	Attempt coercion, return error with expected type if coercion fails
Required Fields	Missing parameters that schema marks as required	Return error listing all missing required fields
Enum Values	Values outside allowed set, case mismatches	Case-insensitive matching, fuzzy match suggestion for typos
Format Validation	Invalid emails, malformed URLs, incorrect date formats	Return error with format example that model can follow
Range Constraints	Numbers outside min/max, strings exceeding length limits	Clamp to range with warning, or error if clamping would change semantics

The validation layer serves dual purposes: it protects your system from malformed inputs and it teaches the model how to call your tools correctly through structured feedback.

Implementing the Agent Execution Loop

The agent execution loop orchestrates the conversation between the model and your tools. A basic loop follows this pattern: send user request to model, check if model wants to call tools, execute requested tools, return results to model, repeat until model provides a final response. The complexity emerges in handling edge cases—what happens when a tool fails, when the agent requests the same tool repeatedly, or when execution exceeds time or cost budgets?

interface AgentConfig {
  maxIterations: number;  // Prevent infinite loops
  maxToolCalls: number;   // Limit total tool invocations
  timeout: number;        // Overall execution timeout
  costLimit: number;      // Maximum API cost per request
}

class AgentExecutor {
  private model: LLMClient;
  private tools: Map;
  private config: AgentConfig;

  async execute(userMessage: string): Promise {
    const conversation: Message[] = [
      { role: "user", content: userMessage }
    ];

    let iteration = 0;
    let totalToolCalls = 0;
    let totalCost = 0;
    const startTime = Date.now();

    while (iteration < this.config.maxIterations) {
      // Check timeouts and budgets before each iteration
      if (Date.now() - startTime > this.config.timeout) {
        throw new AgentError("Execution timeout exceeded");
      }

      if (totalCost > this.config.costLimit) {
        throw new AgentError("Cost limit exceeded");
      }

      // Get model response
      const response = await this.model.complete(conversation, {
        tools: Array.from(this.tools.values()).map(t => t.schema)
      });

      totalCost += response.cost;

      // Check if model wants to use tools
      const toolCalls = this.parseToolCalls(response);

      if (toolCalls.length === 0) {
        // Model provided final answer
        return {
          content: response.content,
          iterations: iteration,
          toolCallCount: totalToolCalls,
          cost: totalCost
        };
      }

      // Execute tools in parallel where possible
      const toolResults = await this.executeTools(toolCalls);
      totalToolCalls += toolCalls.length;

      // Check tool call budget
      if (totalToolCalls > this.config.maxToolCalls) {
        throw new AgentError("Tool call limit exceeded");
      }

      // Add tool results to conversation
      conversation.push({
        role: "assistant",
        content: response.content,
        toolCalls: toolCalls
      });

      conversation.push({
        role: "tool",
        content: toolResults
      });

      iteration++;
    }

    throw new AgentError("Max iterations exceeded without reaching conclusion");
  }

  private async executeTools(toolCalls: ToolCall[]): Promise {
    const results = await Promise.allSettled(
      toolCalls.map(async (call) => {
        const tool = this.tools.get(call.name);

        if (!tool) {
          return {
            id: call.id,
            error: `Tool '${call.name}' not found. Available tools: ${Array.from(this.tools.keys()).join(", ")}`,
            success: false
          };
        }

        try {
          const result = await tool.execute(call.arguments);
          return {
            id: call.id,
            name: call.name,
            result: result,
            success: true
          };
        } catch (error) {
          return {
            id: call.id,
            name: call.name,
            error: error.message,
            success: false
          };
        }
      })
    );

    // Convert Promise results to ToolResult format
    return results.map((r, i) => {
      if (r.status === "fulfilled") {
        return r.value;
      } else {
        return {
          id: toolCalls[i].id,
          error: `Tool execution failed: ${r.reason}`,
          success: false
        };
      }
    });
  }
}

This implementation handles the most common failure modes: infinite loops through iteration limits, runaway costs through budget checks, and tool execution failures through structured error handling. The parallel tool execution improves performance when the agent requests multiple independent tools.

Warning: Without iteration and cost limits, a misbehaving agent can exhaust your API budget in minutes. One production incident saw an agent stuck in a retry loop that burned through $800 in 15 minutes before circuit breakers stopped it. Always implement hard limits with alerting.

Handling Tool Execution Failures

Tool failures fall into three categories: invalid inputs (model error), execution failures (tool error), and availability failures (infrastructure error). Your agent needs different strategies for each.

For invalid inputs, return structured errors that explain what's wrong and how to fix it. The model can often correct its mistakes when given clear feedback. For execution failures, decide whether retry makes sense—network timeouts should retry, but "record not found" errors should not. For availability failures, implement circuit breakers that fail fast after detecting tool degradation rather than wasting tokens on repeated failures.

class ToolExecutor {
  private circuitBreakers: Map;

  async executeTool(tool: Tool, args: any): Promise {
    const breaker = this.circuitBreakers.get(tool.name);

    // Check if tool is currently failing
    if (breaker.isOpen()) {
      return {
        error: `Tool '${tool.name}' is temporarily unavailable due to repeated failures. Try again in ${breaker.resetTime()} seconds.`,
        success: false,
        retryable: false
      };
    }

    try {
      const result = await tool.execute(args);
      breaker.recordSuccess();
      return { result, success: true };
    } catch (error) {
      breaker.recordFailure();

      // Determine if error is retryable
      const retryable = this.isRetryableError(error);

      return {
        error: error.message,
        success: false,
        retryable: retryable,
        errorType: error.constructor.name
      };
    }
  }

  private isRetryableError(error: Error): boolean {
    // Network errors - retry
    if (error.name === "NetworkError" || error.name === "TimeoutError") {
      return true;
    }

    // Rate limits - retry with backoff
    if (error.name === "RateLimitError") {
      return true;
    }

    // Validation errors - don't retry, model needs to fix input
    if (error.name === "ValidationError") {
      return false;
    }

    // Not found errors - don't retry, data doesn't exist
    if (error.name === "NotFoundError") {
      return false;
    }

    // Default: don't retry unknown errors
    return false;
  }
}

Circuit breakers prevent cascading failures. When a tool starts failing repeatedly, the breaker opens and fails fast without attempting execution. This saves tokens, reduces latency, and prevents overwhelming failing services with requests.

Building Secure Tool Execution Sandboxes

Tool execution security is non-negotiable. Your agent will eventually generate a tool call that attempts to read sensitive files, make unauthorized API requests, or execute destructive operations. The model has no concept of security boundaries—it will confidently request tools that would compromise your system if executed naively.

Principle of Least Privilege

Each tool should run with the minimum permissions needed for its function. A tool that reads configuration files shouldn't have write access. A tool that queries a database shouldn't have delete permissions. A tool that calls external APIs shouldn't have access to internal network resources.

// Tool permission system
interface ToolPermissions {
  fileSystem?: {
    readPaths: string[];   // Allowed read paths
    writePaths: string[];  // Allowed write paths
  };
  network?: {
    allowedDomains: string[];  // Whitelist of domains
    allowedPorts: number[];    // Allowed ports
  };
  database?: {
    allowedOperations: ("read" | "write" | "delete")[];
    allowedTables: string[];
  };
}

class SandboxedTool {
  private permissions: ToolPermissions;

  async execute(args: any): Promise {
    // Validate all operations against permissions before execution
    this.validatePermissions(args);

    // Execute in isolated context
    return await this.executeInSandbox(args);
  }

  private validatePermissions(args: any): void {
    // Example: validating file system access
    if (args.path && this.permissions.fileSystem) {
      const normalizedPath = path.resolve(args.path);

      const allowed = this.permissions.fileSystem.readPaths.some(
        allowedPath => normalizedPath.startsWith(path.resolve(allowedPath))
      );

      if (!allowed) {
        throw new SecurityError(
          `Access denied: path '${args.path}' is outside allowed directories`
        );
      }
    }

    // Example: validating network access
    if (args.url && this.permissions.network) {
      const domain = new URL(args.url).hostname;

      const allowed = this.permissions.network.allowedDomains.some(
        allowedDomain => domain === allowedDomain || domain.endsWith(`.${allowedDomain}`)
      );

      if (!allowed) {
        throw new SecurityError(
          `Access denied: domain '${domain}' is not in whitelist`
        );
      }
    }
  }
}

This permission system prevents common security issues: directory traversal attacks, SSRF vulnerabilities, and unauthorized data access. The agent can request any operation, but the sandbox enforces boundaries before execution.

Resource Limits and Rate Limiting

Beyond permissions, tools need resource constraints. A tool that fetches web pages shouldn't download gigabytes of data. A tool that processes files shouldn't consume all available memory. A tool that makes API calls shouldn't trigger rate limit penalties on your external accounts.

Resource Type	Limit Strategy	Enforcement Method
Execution Time	30 second timeout per tool call	Promise.race with timeout, abort controller
Memory Usage	100MB per tool execution	Process isolation, container memory limits
Network I/O	10MB download, 1MB upload per call	Stream size monitoring, abort on limit
API Call Rate	10 calls per minute per endpoint	Token bucket, sliding window counters
File System Operations	1000 operations per execution	Operation counter with quota enforcement

These limits prevent both malicious exploitation and accidental resource exhaustion. They should be configurable per tool based on its legitimate resource requirements, but defaults should be conservative.

Advanced Tool Patterns

Composite Tools and Tool Chaining

Some operations require multiple steps that are better exposed as a single composite tool rather than forcing the agent to chain tools manually. For example, "search and summarize" combines web search with content extraction and summarization. Exposing this as one tool reduces token usage and improves reliability because the composite tool handles the orchestration logic that the agent might get wrong.

// Composite tool example
const searchAndSummarize = {
  name: "search_and_summarize",
  description: "Searches the web for a query, extracts content from top results, and returns a summary. Use this when you need to research a topic and provide a concise summary of findings.",
  parameters: {
    type: "object",
    properties: {
      query: {
        type: "string",
        description: "Search query"
      },
      maxResults: {
        type: "number",
        description: "Number of results to analyze (1-5)",
        default: 3
      }
    },
    required: ["query"]
  },

  async execute({ query, maxResults = 3 }) {
    // Internal orchestration - agent doesn't see these steps
    const searchResults = await searchEngine.search(query, maxResults);

    const contents = await Promise.all(
      searchResults.map(result => webScraper.extract(result.url))
    );

    const combinedContent = contents.join("\n\n");

    // Use a smaller, faster model for summarization
    const summary = await summaryModel.summarize(combinedContent, {
      maxLength: 500
    });

    return {
      query: query,
      sourcesAnalyzed: searchResults.length,
      sources: searchResults.map(r => ({ title: r.title, url: r.url })),
      summary: summary
    };
  }
};

Composite tools trade flexibility for reliability. The agent can't customize the intermediate steps, but it also can't make mistakes in the orchestration logic. Use composite tools for well-defined multi-step operations that you want to optimize and control.

Conditional Tool Availability

Not all tools should be available in all contexts. A tool that modifies production database should only be available to agents with explicit write permissions. A tool that accesses user data should only be available when the agent has valid user context. Dynamic tool availability based on context reduces security risks and helps guide the agent toward appropriate actions.

class ContextAwareToolRegistry {
  private tools: Map;

  getAvailableTools(context: ExecutionContext): Tool[] {
    return Array.from(this.tools.values()).filter(tool => {
      // Check permission requirements
      if (tool.requiredPermissions) {
        const hasPermissions = tool.requiredPermissions.every(
          perm => context.permissions.includes(perm)
        );
        if (!hasPermissions) return false;
      }

      // Check context requirements
      if (tool.requiresUserContext && !context.userId) {
        return false;
      }

      if (tool.requiresOrgContext && !context.organizationId) {
        return false;
      }

      // Check environment restrictions
      if (tool.allowedEnvironments) {
        if (!tool.allowedEnvironments.includes(context.environment)) {
          return false;
        }
      }

      return true;
    });
  }
}

This pattern prevents the agent from even seeing tools it shouldn't use, reducing the token count and eliminating a class of security issues where the agent attempts unauthorized operations.

Debugging and Observability

Agent debugging is fundamentally different from traditional application debugging. The agent's decision process is opaque—you see tool calls and results, but not why the agent chose those specific tools or how it interpreted results. Effective debugging requires comprehensive logging of the entire execution trace and structured analysis of agent behavior patterns.

Execution Trace Logging

Log every step of the agent execution loop: user input, model responses, tool calls with arguments, tool results, and final output. Structure these logs for analysis—unstructured logs are useless when debugging why an agent made unexpected tool choices across a multi-step execution.

interface ExecutionTrace {
  sessionId: string;
  timestamp: string;
  userId?: string;
  model: string;
  input: string;
  iterations: IterationTrace[];
  output: string;
  cost: number;
  duration: number;
  error?: string;
}

interface IterationTrace {
  iteration: number;
  modelInput: Message[];
  modelOutput: string;
  toolCalls: ToolCallTrace[];
  reasoning?: string;  // If model provides reasoning
  tokensUsed: number;
  latency: number;
}

interface ToolCallTrace {
  toolName: string;
  arguments: any;
  result: any;
  error?: string;
  duration: number;
  timestamp: string;
}

class AgentLogger {
  async logExecution(trace: ExecutionTrace): Promise {
    // Store in structured format for analysis
    await this.storage.store({
      ...trace,
      // Add metadata for querying
      tags: this.extractTags(trace),
      toolsUsed: trace.iterations.flatMap(i =>
        i.toolCalls.map(t => t.toolName)
      ),
      success: !trace.error,
      // Searchable fields
      searchText: `${trace.input} ${trace.output}`
    });

    // Real-time alerting for anomalies
    this.checkAnomalies(trace);
  }

  private checkAnomalies(trace: ExecutionTrace): void {
    // Detect concerning patterns
    if (trace.iterations.length >= 8) {
      this.alert("High iteration count", trace);
    }

    if (trace.cost > 0.50) {
      this.alert("High cost execution", trace);
    }

    const failedTools = trace.iterations.flatMap(i =>
      i.toolCalls.filter(t => t.error)
    );

    if (failedTools.length >= 3) {
      this.alert("Multiple tool failures", trace);
    }
  }
}

This structured logging enables queries like "show me all executions where the search_web tool was called more than twice" or "find sessions where the agent gave up without calling the required tool." These queries are impossible with unstructured logs.

Common Failure Pattern Detection

Agents fail in predictable ways. They get stuck in loops calling the same tool repeatedly with identical arguments. They hallucinate tool names that don't exist. They provide arguments in the wrong format despite clear schema definitions. They ignore tool results and proceed as if the call succeeded. Detecting these patterns early helps you improve tool designs and add guardrails.

Pro Tip: Build a dashboard that shows tool success rates, average iterations per task type, and cost distribution across tools. Unexpected changes in these metrics often indicate agent behavior regressions or tool reliability issues before users report problems.

Production Deployment Considerations

Cost Management

Tool-using agents are more expensive than simple completions because they require multiple model calls per user interaction. A task that takes 5 iterations consumes 5x the tokens of a single completion. Monitor per-user costs and implement budgets to prevent runaway spending.

class CostTracker {
  private userBudgets: Map;

  async checkBudget(userId: string, estimatedCost: number): Promise {
    const budget = await this.getUserBudget(userId);

    // Check daily limit
    if (budget.todaySpend + estimatedCost > budget.dailyLimit) {
      throw new BudgetExceededError(
        `Daily budget exceeded. Used: $${budget.todaySpend}, Limit: $${budget.dailyLimit}`
      );
    }

    // Check monthly limit
    if (budget.monthSpend + estimatedCost > budget.monthlyLimit) {
      throw new BudgetExceededError(
        `Monthly budget exceeded. Used: $${budget.monthSpend}, Limit: $${budget.monthlyLimit}`
      );
    }

    return true;
  }

  async recordCost(userId: string, cost: number, metadata: CostMetadata): Promise {
    // Update user budget tracking
    await this.storage.increment(`budget:${userId}:daily`, cost);
    await this.storage.increment(`budget:${userId}:monthly`, cost);

    // Store detailed breakdown for analysis
    await this.storage.store({
      userId,
      cost,
      timestamp: Date.now(),
      model: metadata.model,
      inputTokens: metadata.inputTokens,
      outputTokens: metadata.outputTokens,
      toolCalls: metadata.toolCalls,
      task: metadata.task
    });
  }
}

Budget enforcement prevents individual users from exhausting your API budget and provides data for cost optimization. You'll discover which tools are expensive, which tasks require excessive iterations, and where caching or tool optimization would have the biggest impact.

Scaling Tool Execution

Tool execution can become a bottleneck. If your agent calls three tools and each takes 2 seconds, that's 6 seconds of latency in the best case. Parallel execution helps when tools are independent, but many tools depend on results from previous calls. Optimize hot-path tools first—the 20% of tools that handle 80% of calls.

Optimization	Applies To	Expected Improvement
Result Caching	Deterministic tools called with repeated arguments	90%+ latency reduction for cache hits
Connection Pooling	Tools making database or API calls	30-50% latency reduction
Batch Operations	Tools called multiple times in sequence	50-70% latency reduction
Response Streaming	Tools returning large data sets	Perceived latency improvement, same total time
Precomputation	Tools performing expensive calculations	80%+ latency reduction for precomputable queries

Testing Tool-Using Agents

Traditional unit tests are insufficient for agents. You need integration tests that verify the agent selects correct tools, handles errors gracefully, and produces correct outputs across multi-step executions. The challenge: agents are non-deterministic. The same input can produce different tool call sequences that both arrive at correct answers.

Test Strategy

Focus on outcome verification rather than execution path verification. Test that the agent achieves the goal, not that it follows a specific sequence of tool calls. Mock external dependencies but keep the agent execution loop real to catch orchestration bugs.

describe("Agent Tool Use", () => {
  it("should retrieve and analyze data when asked for insights", async () => {
    // Setup mock tools
    const mockDatabase = createMockTool("query_database", {
      result: { users: 1500, revenue: 45000 }
    });

    const mockCalculator = createMockTool("calculate", {
      result: 30  // revenue per user
    });

    const agent = new AgentExecutor({
      tools: [mockDatabase, mockCalculator],
      model: testModel
    });

    // Execute agent
    const result = await agent.execute(
      "What's our revenue per user?"
    );

    // Verify outcome (not specific tool calls)
    expect(result.content).toContain("30");
    expect(result.content).toMatch(/revenue per user/i);

    // Verify required tools were called
    expect(mockDatabase.called).toBe(true);

    // Allow flexibility in execution path
    // Agent might use calculator tool or compute directly
  });

  it("should handle tool failures gracefully", async () => {
    const failingTool = createMockTool("query_database", {
      error: "Database connection timeout"
    });

    const agent = new AgentExecutor({
      tools: [failingTool],
      model: testModel
    });

    const result = await agent.execute(
      "Get user count from database"
    );

    // Agent should acknowledge failure in response
    expect(result.content).toMatch(/unable|error|failed/i);

    // Should not crash or loop indefinitely
    expect(result.iterations).toBeLessThan(5);
  });
});

This testing approach validates agent behavior without over-constraining implementation details. As you improve the agent or tools, tests remain valid as long as outcomes are correct.

FAQ

How do I prevent agents from calling the same tool repeatedly in a loop?

Implement loop detection by tracking tool call history within an execution. If the agent calls the same tool with identical or near-identical arguments more than twice, intervene by returning an error that prompts the agent to try a different approach. Set a maximum iteration count (typically 10-15) as a hard stop. Log these incidents—they often indicate tool schema problems or model limitations that you can address.

Should I use function calling or prompt-based tool use?

Use native function calling APIs (OpenAI functions, Claude tool use) when available. They're more reliable than prompt engineering because the model was fine-tuned for structured output. Prompt-based tool use works for models without native support but requires more error handling for malformed outputs. The abstraction layer pattern shown earlier lets you support both and switch between models without rewriting your agent logic.

How do I handle tools that take longer than a few seconds?

Implement async tool execution with a callback mechanism. When a tool is submitted, immediately return a task ID to the agent. Provide a separate "check_task_status" tool that the agent can call to see if results are ready. This prevents timeout issues and lets the agent work on other subtasks while waiting. For very long operations (minutes to hours), consider moving to a job queue system with webhook notifications.

What's the right number of tools to give an agent?

Start with fewer tools (5-10) focused on core functionality. As you add tools, monitor selection accuracy. When agents consistently choose wrong tools or claim tools don't exist when they do, you've hit the model's tool selection limit. Current models handle 20-50 tools reasonably well, but quality degrades with more. Group related tools into composite tools or use hierarchical tool selection where agents first choose a category, then specific tools within that category.

How do I debug why an agent chose the wrong tool?

Examine the tool descriptions and user input together. Models rely heavily on keyword matching between the input and tool descriptions. If your "get_user_profile" tool description says "retrieves user data" but the user asked to "fetch account information," the model might not match them. Make descriptions comprehensive with synonyms and example use cases. Log the full tool list sent to the model—sometimes the right tool wasn't available in that context.

Can I let agents call tools that modify data?

Yes, but implement approval workflows for destructive operations. Modification tools should return a preview of what will change and wait for explicit confirmation before executing. This confirmation can come from the user or from another tool that validates the safety of the operation. Never give agents direct delete or modify access to production data without safeguards—model hallucinations can generate plausible-looking but incorrect operations.

How do I handle tool calls that need authentication?

Pass authentication context through the execution environment, not through tool arguments. The agent shouldn't see or manage API keys, tokens, or credentials. Your tool execution layer retrieves credentials from secure storage based on the user context. If a tool needs user-specific authentication (like accessing their GitHub repos), use OAuth flows where users grant permissions upfront, and tools use stored tokens transparently.

What's the best way to handle rate limits on external APIs called by tools?

Implement rate limiting at the tool level with exponential backoff. When a tool hits a rate limit, return an error with retry timing to the agent. Most agents handle "retry in X seconds" responses well. For aggressive rate limits, implement request queuing where tool calls enter a queue with rate limit awareness. The queue processes requests at the maximum safe rate and returns results to the appropriate agent execution context.

Should tools return raw data or processed summaries?

Return structured data that balances completeness with token efficiency. For large datasets, return summaries with a "get_details" parameter that the agent can use to fetch full records if needed. The initial call to a "search_database" tool might return 10 result titles and IDs, then the agent can call "get_record" for specific IDs it wants to analyze. This pattern reduces token usage while keeping information available.

How do I test tools in isolation before giving them to agents?

Write standard unit tests for tool logic independent of agent execution. Create a test harness that calls tools with various argument combinations, including edge cases and invalid inputs. Verify error messages are clear and actionable—these messages teach the agent how to use your tools correctly. Test timeout behavior, rate limiting, and retry logic. Only after tools pass isolation testing should you expose them to agent execution.

Conclusion

Building AI agents with reliable tool use requires treating tools as a first-class architectural concern, not a feature addition. The fundamental decisions—synchronous vs async execution, tool permission models, error handling strategies, and observability design—determine whether your agent handles edge cases gracefully or fails unpredictably under real-world conditions.

Start with a small set of well-defined tools with comprehensive schemas and strong validation. Implement execution limits and security boundaries before deploying to production. Build observability into every layer of the execution loop so you can debug behavior and optimize costs. As your agent proves reliable with core tools, expand capabilities incrementally while monitoring selection accuracy and execution patterns. The agents that succeed in production are those built with the assumption that every possible failure mode will eventually occur.

How to Build an AI Agent with Tool Use

How to Build an AI Agent with Tool Use

Understanding Tool Use Architecture

Tool Call Protocol Standards

Designing Reliable Tool Schemas

Schema Design Principles

Parameter Validation Strategies

Implementing the Agent Execution Loop

Handling Tool Execution Failures

Building Secure Tool Execution Sandboxes

Principle of Least Privilege

Resource Limits and Rate Limiting

Advanced Tool Patterns

Composite Tools and Tool Chaining

Conditional Tool Availability

Debugging and Observability

Execution Trace Logging

Common Failure Pattern Detection

Production Deployment Considerations

Cost Management

Scaling Tool Execution

Testing Tool-Using Agents

Test Strategy

FAQ

How do I prevent agents from calling the same tool repeatedly in a loop?

Should I use function calling or prompt-based tool use?

How do I handle tools that take longer than a few seconds?

What's the right number of tools to give an agent?

How do I debug why an agent chose the wrong tool?

Can I let agents call tools that modify data?

How do I handle tool calls that need authentication?

What's the best way to handle rate limits on external APIs called by tools?

Should tools return raw data or processed summaries?

How do I test tools in isolation before giving them to agents?

Conclusion

Share on Social Media:

Bright SEO Tools