How to Build a Chatbot with LangChain and Node.js

Building a chatbot that remembers conversation context, retrieves relevant information from your documents, and maintains coherent multi-turn conversations requires orchestrating several components: language models, memory management, document retrieval, and prompt engineering. LangChain abstracts this complexity into reusable patterns, though the abstraction introduces its own learning curve and occasional frustrations when the framework's opinions don't match your requirements.

This guide builds a production-capable chatbot from scratch using LangChain.js and Node.js. You'll implement conversation memory that persists across sessions, document-based question answering that retrieves context from your knowledge base, and streaming responses for better user experience. The focus is on patterns that scale beyond prototypes—handling edge cases, managing costs, and structuring code for maintainability.

We assume familiarity with Node.js and basic REST API concepts but explain LangChain-specific patterns as we build.

Understanding LangChain's Core Abstractions

LangChain organizes LLM applications around several core concepts that map to common chatbot requirements.

Models represent the language model interface. LangChain supports multiple providers (OpenAI, Anthropic, Cohere) through a unified interface, letting you switch models without rewriting application logic. ChatOpenAI, ChatAnthropic, and similar classes wrap provider-specific APIs.

Chains combine models with prompts and logic into reusable sequences. A simple chain might format a prompt, call the model, and parse the response. Complex chains orchestrate multiple steps—retrieving documents, then summarizing them, then answering a question based on the summary.

Memory manages conversation state. Without memory, each query is independent. With memory, the chatbot maintains context across turns, remembering what the user said three messages ago. Different memory types offer tradeoffs between context size and relevance.

Retrievers fetch relevant documents from knowledge bases. When a user asks a question, the retriever finds related content from your documentation, past conversations, or data stores. This retrieved context then informs the model's response.

These abstractions work together: a retriever fetches relevant documents, memory provides conversation history, a chain orchestrates calling the model with both inputs, and the model generates a response. Understanding this flow matters more than memorizing individual API methods.

Initial Setup and Dependencies

Start by creating a new Node.js project and installing required dependencies:

npm init -y
npm install langchain @langchain/openai @langchain/anthropic
npm install dotenv express

LangChain.js splits into modular packages. The core langchain package provides abstractions, while provider-specific packages (@langchain/openai, @langchain/anthropic) contain model implementations. This modular structure means larger bundle sizes if you import everything, but better tree-shaking for production builds.

Configure environment variables for API access:

// .env
OPENAI_API_KEY=sk-...your-key...
PORT=3000

Basic project structure:

chatbot/
├── src/
│   ├── bot.js          // Chatbot logic
│   ├── server.js       // Express server
│   └── memory.js       // Memory management
├── .env
└── package.json

Building a Basic Conversational Chain

The simplest chatbot uses a conversational chain that maintains message history:

// src/bot.js
import { ChatOpenAI } from '@langchain/openai';
import { ChatPromptTemplate } from '@langchain/core/prompts';
import { ConversationChain } from 'langchain/chains';
import { BufferMemory } from 'langchain/memory';

export class SimpleChatbot {
  constructor() {
    this.model = new ChatOpenAI({
      modelName: 'gpt-3.5-turbo',
      temperature: 0.7
    });

    this.memory = new BufferMemory({
      returnMessages: true,
      memoryKey: 'history'
    });

    this.chain = new ConversationChain({
      llm: this.model,
      memory: this.memory
    });
  }

  async chat(message) {
    const response = await this.chain.call({
      input: message
    });

    return response.response;
  }
}

BufferMemory stores all messages in memory. This works for short conversations but fails when conversation length exceeds the model's context window. The memory keeps accumulating messages until you either hit token limits or run out of RAM.

The ConversationChain automatically formats prompts with history, calls the model, and updates memory. This convenience hides details—useful for prototypes but limiting when you need custom behavior.

Implementing Conversation Memory Strategies

Production chatbots need memory management that handles long conversations without exceeding context limits or costs.

BufferWindowMemory keeps only the last N messages:

import { BufferWindowMemory } from 'langchain/memory';

const memory = new BufferWindowMemory({
  k: 5, // Keep last 5 exchanges (10 messages total)
  returnMessages: true,
  memoryKey: 'history'
});

This prevents unbounded growth but loses older context. A user referring to something mentioned 10 messages ago won't get relevant responses.

SummaryMemory summarizes old messages instead of dropping them:

import { ConversationSummaryMemory } from 'langchain/memory';

const memory = new ConversationSummaryMemory({
  llm: new ChatOpenAI({ modelName: 'gpt-3.5-turbo' }),
  returnMessages: true,
  memoryKey: 'history'
});

When conversation history grows large, SummaryMemory uses the LLM to create a summary of old messages, keeping recent messages verbatim. This maintains context while controlling token usage. The tradeoff: summarization requires additional LLM calls, increasing costs and latency.

VectorStoreRetrieverMemory stores all messages in a vector database and retrieves the most relevant ones:

import { VectorStoreRetrieverMemory } from 'langchain/memory';
import { MemoryVectorStore } from 'langchain/vectorstores/memory';
import { OpenAIEmbeddings } from '@langchain/openai';

const vectorStore = new MemoryVectorStore(
  new OpenAIEmbeddings()
);

const memory = new VectorStoreRetrieverMemory({
  vectorStoreRetriever: vectorStore.asRetriever(5),
  memoryKey: 'history',
  returnMessages: true
});

This approach finds semantically relevant past messages regardless of position in conversation history. A user asking "what was the price we discussed?" retrieves the pricing conversation even if it occurred 100 messages ago. The cost: embedding every message (small but accumulating) and vector search overhead.

Memory Strategy Selection: BufferWindowMemory for most use cases, SummaryMemory when older context frequently matters, VectorStoreRetrieverMemory when conversations span hundreds of messages with non-linear topic switches. Start simple, add complexity only when conversation quality suffers.

Adding Document-Based Question Answering

Chatbots often need to answer questions about your specific content—documentation, policies, product information. Retrieval-Augmented Generation (RAG) fetches relevant documents and includes them in the prompt.

First, prepare and embed your documents:

import { TextLoader } from 'langchain/document_loaders/fs/text';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { MemoryVectorStore } from 'langchain/vectorstores/memory';
import { OpenAIEmbeddings } from '@langchain/openai';

async function prepareKnowledgeBase() {
  // Load documents
  const loader = new TextLoader('./docs/faq.txt');
  const docs = await loader.load();

  // Split into chunks
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200
  });
  const splitDocs = await splitter.splitDocuments(docs);

  // Create vector store
  const vectorStore = await MemoryVectorStore.fromDocuments(
    splitDocs,
    new OpenAIEmbeddings()
  );

  return vectorStore;
}

The RecursiveCharacterTextSplitter divides documents into chunks small enough to fit in context windows. chunkSize determines maximum chunk length (in characters, not tokens—account for roughly 0.75 multiplier). chunkOverlap creates overlap between chunks so information spanning chunk boundaries doesn't get fragmented.

Integrate retrieval into the chatbot:

import { RetrievalQAChain } from 'langchain/chains';

export class DocumentChatbot {
  constructor(vectorStore) {
    this.model = new ChatOpenAI({
      modelName: 'gpt-3.5-turbo'
    });

    this.chain = RetrievalQAChain.fromLLM(
      this.model,
      vectorStore.asRetriever(3) // Retrieve top 3 documents
    );
  }

  async ask(question) {
    const response = await this.chain.call({
      query: question
    });

    return response.text;
  }
}

When a user asks a question, the retriever finds the 3 most relevant document chunks by semantic similarity. These chunks become part of the prompt: "Given these documents: [chunks], answer: [question]." This grounds responses in your actual content rather than the model's training data.

The number of retrieved chunks trades off context quality versus token cost. More chunks provide more context but consume more tokens and can introduce irrelevant information that confuses the model. Test with your specific documents to find the optimal number—usually 3-5 chunks.

Implementing Streaming Responses

Streaming displays partial responses as they generate, dramatically improving perceived performance. LangChain supports streaming through callbacks:

import { ChatOpenAI } from '@langchain/openai';

export class StreamingChatbot {
  constructor() {
    this.model = new ChatOpenAI({
      modelName: 'gpt-3.5-turbo',
      streaming: true
    });
  }

  async chat(message, onToken) {
    const response = await this.model.call(
      [{ role: 'user', content: message }],
      {
        callbacks: [{
          handleLLMNewToken(token) {
            onToken(token); // Called for each token
          }
        }]
      }
    );

    return response.content;
  }
}

The handleLLMNewToken callback fires for each token as it arrives. Integrate with Server-Sent Events (SSE) for browser clients:

// src/server.js
import express from 'express';
import { StreamingChatbot } from './bot.js';

const app = express();
const bot = new StreamingChatbot();

app.post('/chat/stream', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  const { message } = req.body;

  try {
    await bot.chat(message, (token) => {
      res.write(`data: ${JSON.stringify({ token })}\n\n`);
    });

    res.write('data: [DONE]\n\n');
    res.end();
  } catch (error) {
    res.write(`data: ${JSON.stringify({ error: error.message })}\n\n`);
    res.end();
  }
});

Client-side consumption uses EventSource:

const eventSource = new EventSource('/chat/stream', {
  method: 'POST',
  body: JSON.stringify({ message: 'Hello' })
});

eventSource.onmessage = (event) => {
  if (event.data === '[DONE]') {
    eventSource.close();
    return;
  }

  const { token } = JSON.parse(event.data);
  // Append token to display
};

Combining Memory and Retrieval

Production chatbots typically need both conversation memory and document retrieval. LangChain's ConversationalRetrievalQAChain combines these:

import { ConversationalRetrievalQAChain } from 'langchain/chains';
import { BufferMemory } from 'langchain/memory';

export class AdvancedChatbot {
  constructor(vectorStore) {
    this.model = new ChatOpenAI({
      modelName: 'gpt-3.5-turbo',
      temperature: 0.7
    });

    this.memory = new BufferMemory({
      memoryKey: 'chat_history',
      returnMessages: true
    });

    this.chain = ConversationalRetrievalQAChain.fromLLM(
      this.model,
      vectorStore.asRetriever(3),
      {
        memory: this.memory,
        returnSourceDocuments: true
      }
    );
  }

  async chat(message) {
    const response = await this.chain.call({
      question: message
    });

    return {
      answer: response.text,
      sources: response.sourceDocuments.map(doc => ({
        content: doc.pageContent,
        metadata: doc.metadata
      }))
    };
  }
}

This chain maintains conversation history while retrieving relevant documents for each query. The returnSourceDocuments option includes the actual chunks used, letting you show citations to users—important for trust and fact-checking.

The chain internally reformulates the user's question based on chat history before retrieving documents. A user asking "What about pricing?" after previously asking about features gets their question rephrased to "What about pricing for [product mentioned in history]?" This improves retrieval relevance for multi-turn conversations.

Cost Warning: ConversationalRetrievalQAChain makes multiple LLM calls per query—one to reformulate the question based on history, one to generate the answer. Monitor token usage carefully as costs accumulate faster than simple chains.

Persisting Conversation Memory

In-memory storage loses conversation history when the server restarts. Production applications need persistent storage:

// src/memory.js
import { BufferMemory } from 'langchain/memory';
import { ChatMessageHistory } from 'langchain/memory';
import { HumanMessage, AIMessage } from '@langchain/core/messages';
import fs from 'fs/promises';

export class PersistentMemory {
  constructor(sessionId) {
    this.sessionId = sessionId;
    this.filePath = `./sessions/${sessionId}.json`;
  }

  async load() {
    try {
      const data = await fs.readFile(this.filePath, 'utf-8');
      const messages = JSON.parse(data);

      const history = new ChatMessageHistory(
        messages.map(msg =>
          msg.type === 'human'
            ? new HumanMessage(msg.content)
            : new AIMessage(msg.content)
        )
      );

      return new BufferMemory({
        chatHistory: history,
        returnMessages: true,
        memoryKey: 'history'
      });
    } catch {
      // No existing session
      return new BufferMemory({
        returnMessages: true,
        memoryKey: 'history'
      });
    }
  }

  async save(memory) {
    const messages = await memory.chatHistory.getMessages();
    const serialized = messages.map(msg => ({
      type: msg._getType(),
      content: msg.content
    }));

    await fs.mkdir('./sessions', { recursive: true });
    await fs.writeFile(
      this.filePath,
      JSON.stringify(serialized, null, 2)
    );
  }
}

Integration with the chatbot:

export class PersistentChatbot {
  async chat(sessionId, message) {
    const persistentMemory = new PersistentMemory(sessionId);
    const memory = await persistentMemory.load();

    const chain = new ConversationChain({
      llm: this.model,
      memory
    });

    const response = await chain.call({ input: message });

    await persistentMemory.save(memory);

    return response.response;
  }
}

For production, replace file-based storage with Redis, PostgreSQL, or MongoDB. LangChain provides integrations for various stores, though implementing custom persistence often gives you more control over serialization and retrieval logic.

Implementing Custom Prompt Templates

Default prompts work for basic use cases, but customization improves response quality for specific domains:

import { ChatPromptTemplate, MessagesPlaceholder } from '@langchain/core/prompts';

const customPrompt = ChatPromptTemplate.fromMessages([
  [
    'system',
    `You are a technical support chatbot for a SaaS product.
    Rules:
    - Provide concise, actionable answers
    - If you don't know something, say so clearly
    - Reference documentation when available
    - Use technical terminology appropriate for developers`
  ],
  new MessagesPlaceholder('history'),
  ['human', '{input}']
]);

const chain = new ConversationChain({
  llm: this.model,
  memory: this.memory,
  prompt: customPrompt
});

The system message sets behavioral guidelines. MessagesPlaceholder injects conversation history. The human message contains the current query. This structure gives you precise control over how context is presented to the model.

For retrieval chains, customize how retrieved documents are formatted:

const qaPrompt = ChatPromptTemplate.fromMessages([
  [
    'system',
    `Answer based on the following documentation excerpts:
    {context}

    If the documentation doesn't contain relevant information, say so.
    Include source references in your answer.`
  ],
  new MessagesPlaceholder('chat_history'),
  ['human', '{question}']
]);

const chain = ConversationalRetrievalQAChain.fromLLM(
  this.model,
  vectorStore.asRetriever(),
  {
    memory: this.memory,
    qaChainOptions: {
      prompt: qaPrompt
    }
  }
);

Error Handling and Resilience

LangChain operations can fail in multiple ways: LLM API errors, retrieval failures, memory serialization issues. Robust error handling prevents cascading failures:

export class ResilientChatbot {
  async chat(message, retries = 3) {
    for (let attempt = 0; attempt < retries; attempt++) {
      try {
        const response = await this.chain.call({
          input: message
        });

        return {
          success: true,
          response: response.response
        };
      } catch (error) {
        console.error(`Attempt ${attempt + 1} failed:`, error);

        if (error.status === 429) {
          // Rate limit - wait and retry
          await new Promise(r => setTimeout(r, 2 ** attempt * 1000));
          continue;
        }

        if (error.status >= 500) {
          // Server error - retry
          await new Promise(r => setTimeout(r, 1000));
          continue;
        }

        // Client error - don't retry
        return {
          success: false,
          error: 'Unable to process your request. Please try again.'
        };
      }
    }

    return {
      success: false,
      error: 'Service temporarily unavailable.'
    };
  }
}

Implement circuit breakers for retrieval to prevent repeated failures from overwhelming your vector store:

class CircuitBreaker {
  constructor(failureThreshold = 5, resetTimeout = 60000) {
    this.failureCount = 0;
    this.failureThreshold = failureThreshold;
    this.resetTimeout = resetTimeout;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.nextAttempt = Date.now();
  }

  async execute(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker is OPEN');
      }
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.resetTimeout;
    }
  }
}

Production Deployment Patterns

Production chatbots require session management, rate limiting, and monitoring:

// src/server.js
import express from 'express';
import { RateLimiterMemory } from 'rate-limiter-flexible';

const app = express();
const chatbot = new AdvancedChatbot(vectorStore);

const rateLimiter = new RateLimiterMemory({
  points: 10, // Number of requests
  duration: 60 // Per 60 seconds
});

app.post('/chat', async (req, res) => {
  const { sessionId, message } = req.body;

  try {
    await rateLimiter.consume(sessionId);
  } catch {
    return res.status(429).json({
      error: 'Too many requests. Please wait.'
    });
  }

  try {
    const startTime = Date.now();
    const response = await chatbot.chat(message);
    const duration = Date.now() - startTime;

    // Log for monitoring
    console.log({
      sessionId,
      duration,
      messageLength: message.length,
      responseLength: response.answer.length
    });

    res.json(response);
  } catch (error) {
    console.error('Chat error:', error);
    res.status(500).json({
      error: 'Failed to process message'
    });
  }
});

Monitor token usage and costs by tracking actual LLM calls:

class MonitoredChatbot {
  constructor(vectorStore) {
    this.model = new ChatOpenAI({
      modelName: 'gpt-3.5-turbo',
      callbacks: [{
        handleLLMEnd(output) {
          const usage = output.llmOutput?.tokenUsage;
          if (usage) {
            console.log('Token usage:', {
              prompt: usage.promptTokens,
              completion: usage.completionTokens,
              total: usage.totalTokens
            });
          }
        }
      }]
    });

    // Rest of initialization
  }
}

Production Checklist: Session-based memory persistence, per-user rate limiting, token usage monitoring, error tracking with structured logs, circuit breakers for external dependencies, graceful degradation when LLM is unavailable, and response time tracking for performance optimization.

Frequently Asked Questions

How do I handle conversations that exceed the context window?

Use SummaryMemory to condense old messages or VectorStoreRetrieverMemory to retrieve only relevant past exchanges. Alternatively, implement automatic conversation summarization every N messages, storing the summary and clearing old messages. Test different strategies with your typical conversation patterns—customer support benefits from different approaches than creative writing assistants.

Can I use LangChain with Anthropic Claude or Google Gemini instead of OpenAI?

Yes. Import ChatAnthropic or ChatGoogleGenerativeAI and use them identically to ChatOpenAI. The abstractions work across providers, though provider-specific features (like Claude's extended context) require provider-specific configuration. Switching providers typically requires changing only the model initialization, not the chain logic.

How do I prevent the chatbot from making up information not in my documents?

Set temperature to 0 or near 0 for more deterministic responses. Customize prompts to explicitly instruct the model to only use provided context. Include phrases like "If the information isn't in the documentation, say so explicitly." Monitor responses for hallucinations and fine-tune prompts based on failure patterns.

What's the difference between RetrievalQAChain and ConversationalRetrievalQAChain?

RetrievalQAChain answers single questions without conversation history. ConversationalRetrievalQAChain maintains memory across turns and reformulates questions based on history. Use RetrievalQAChain for independent queries (search-style interactions). Use ConversationalRetrievalQAChain for back-and-forth conversations where context matters.

How many document chunks should I retrieve per query?

Start with 3-5 chunks. More chunks provide more context but increase token costs and can introduce noise. Test retrieval quality by examining which chunks get returned for typical queries. If relevant information frequently appears in lower-ranked chunks, increase the count. If top chunks consistently contain the answer, reduce the count to save tokens.

Should I use LangChain or build custom LLM integration?

Use LangChain for prototypes and standard use cases (conversational agents, document Q&A). Build custom integration when you need fine-grained control over prompts, custom memory strategies, or integration with specialized infrastructure. LangChain's abstractions save time initially but can become constraints for complex custom requirements.

How do I debug LangChain chains when responses aren't what I expect?

Enable verbose mode on chains to see intermediate steps. Set verbose: true in chain configuration to log prompts and responses at each step. Use callbacks to intercept prompts before they're sent to the model—this reveals how LangChain formats your inputs. Test components in isolation (retriever alone, memory alone) before debugging integrated chains.

Can I deploy LangChain chatbots on serverless platforms?

Yes, with caveats. Serverless cold starts add latency (2-5 seconds) to first requests. For better UX, use provisioned concurrency or keep functions warm with periodic pings. Vector store retrieval works better with managed services (Pinecone, Weaviate) than trying to bundle embeddings in the function. Session persistence requires external storage since serverless functions are stateless.

Conclusion

Building production chatbots with LangChain and Node.js requires understanding the framework's memory abstractions, retrieval patterns, and chain compositions. BufferWindowMemory handles most conversation memory needs without complex management. ConversationalRetrievalQAChain combines document retrieval with conversation context for knowledge-based chatbots. Streaming responses through Server-Sent Events creates responsive user experiences despite LLM generation latency.

The framework accelerates development for standard patterns but introduces complexity when your requirements diverge from LangChain's opinions. Start with built-in chains and memory types, then customize prompts before implementing entirely custom logic. Monitor token usage from the beginning—LangChain's abstractions can hide expensive operations like automatic question reformulation or conversation summarization that dramatically increase costs at scale.

How to Build a Chatbot with LangChain and Node.js

How to Build a Chatbot with LangChain and Node.js

Understanding LangChain's Core Abstractions

Initial Setup and Dependencies

Building a Basic Conversational Chain

Implementing Conversation Memory Strategies

Adding Document-Based Question Answering

Implementing Streaming Responses

Combining Memory and Retrieval

Persisting Conversation Memory

Implementing Custom Prompt Templates

Error Handling and Resilience

Production Deployment Patterns

Frequently Asked Questions

How do I handle conversations that exceed the context window?

Can I use LangChain with Anthropic Claude or Google Gemini instead of OpenAI?

How do I prevent the chatbot from making up information not in my documents?

What's the difference between RetrievalQAChain and ConversationalRetrievalQAChain?

How many document chunks should I retrieve per query?

Should I use LangChain or build custom LLM integration?

How do I debug LangChain chains when responses aren't what I expect?

Can I deploy LangChain chatbots on serverless platforms?

Conclusion

Share on Social Media:

Bright SEO Tools