How to Stream AI Responses in a React App
How to Stream AI Responses in a React App
Non-streaming AI responses force users to stare at loading spinners for 10-30 seconds while waiting for complete answers. This creates the perception of slowness even when your backend is fast, drives users away before they see results, and wastes the progressive nature of LLM generation. Yet most React developers implement AI features with simple fetch calls that block until the entire response arrives, missing the opportunity to deliver text as it's generated.
This article explains how to implement streaming AI responses in React applications using patterns that handle network failures gracefully, update UI incrementally without performance issues, and work with popular AI APIs including OpenAI, Anthropic, and self-hosted models. You'll learn how to manage streaming state in React components, implement retry logic for interrupted streams, and optimize rendering performance when appending text rapidly.
We'll cover Server-Sent Events (SSE) for unidirectional streaming, ReadableStream processing with the Fetch API, React state management for streaming text, and error handling patterns that recover gracefully from network interruptions.
Why Streaming Matters for AI Interfaces
User experience research consistently shows that perceived performance matters more than actual performance. A 20-second response that arrives all at once feels slower than a 25-second response that starts appearing after 2 seconds. Streaming exploits this perceptual difference—users see progress immediately, can start reading while generation continues, and feel the interface is responsive even when total latency is higher.
The technical reality: LLMs generate text token by token. Claude produces approximately 50-100 tokens per second. GPT-4 produces similar rates. When you request a non-streaming response, the API buffers all tokens until generation completes, then sends the complete text. Streaming sends each token immediately, eliminating buffering latency. The total time to completion is similar, but time to first token drops from 10-15 seconds to 1-2 seconds.
This matters most for long responses. A 1000-token response takes 10-20 seconds to generate. Without streaming, users wait the entire duration seeing nothing. With streaming, they see the first sentence within 2 seconds and can start reading while the rest generates. The experience transforms from "is this broken?" to "this is actively working."
The Cost of Not Streaming
Non-streaming implementations have predictable failure modes. Users close the tab or browser because they think the request hung. Mobile users on unreliable networks hit timeouts before responses complete. Long responses trigger gateway timeouts (typically 30-60 seconds) that terminate connections before the API finishes generating.
Streaming solves these problems by establishing connection health immediately. The first token arrives within seconds, confirming the request succeeded. Subsequent tokens arrive steadily, resetting timeout counters. Users see continuous progress, reducing abandonment. Network interruptions are detectable mid-stream, enabling retry logic that resumes from failure points instead of restarting entirely.
Backend Streaming Setup: Server-Sent Events
Before implementing React streaming consumers, your backend must expose a streaming endpoint. Server-Sent Events (SSE) provide a simple protocol for server-to-client streaming. The server sends text chunks prefixed with "data: ", and the browser's EventSource API parses them automatically. For more control, use the Fetch API with ReadableStream, which we'll cover later.
Node.js Backend with Express
// backend/routes/chat.js
import express from 'express';
import Anthropic from '@anthropic-ai/sdk';
const router = express.Router();
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
router.post('/chat/stream', async (req, res) => {
const { message, conversationHistory } = req.body;
// Set SSE headers
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
try {
const messages = [
...conversationHistory,
{ role: 'user', content: message }
];
const stream = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 4096,
messages,
stream: true
});
// Send chunks to client
for await (const event of stream) {
if (event.type === 'content_block_delta') {
res.write(`data: ${JSON.stringify({
type: 'chunk',
text: event.delta.text
})}\n\n`);
}
if (event.type === 'message_stop') {
res.write(`data: ${JSON.stringify({
type: 'done'
})}\n\n`);
}
}
res.end();
} catch (error) {
res.write(`data: ${JSON.stringify({
type: 'error',
message: error.message
})}\n\n`);
res.end();
}
});
export default router;
The SSE format is simple: each message starts with "data: ", followed by JSON, followed by two newlines. The browser parses this automatically when using EventSource, or you parse it manually when using Fetch. The format is human-readable, making debugging straightforward—you can test SSE endpoints with curl and see the raw stream.
Alternative: ReadableStream with Transform Stream
For more control over chunking or when integrating with edge runtimes (Vercel, Cloudflare Workers), use ReadableStream directly instead of SSE format. This approach works with the Fetch API's streaming body reader.
// Next.js API route with ReadableStream
export async function POST(req: Request) {
const { message } = await req.json();
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
const aiStream = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 2048,
messages: [{ role: 'user', content: message }],
stream: true
});
for await (const event of aiStream) {
if (event.type === 'content_block_delta') {
// Send text chunk
controller.enqueue(
encoder.encode(event.delta.text)
);
}
}
controller.close();
} catch (error) {
controller.error(error);
}
}
});
return new Response(stream, {
headers: {
'Content-Type': 'text/plain; charset=utf-8',
'Transfer-Encoding': 'chunked'
}
});
}
ReadableStream provides more flexibility than SSE. You control exactly what gets sent—raw text, JSON, or custom formats. This matters for edge runtimes that have limitations on response formats or when you need to transform the stream (filtering, compression, rate limiting) before sending to clients.
React Implementation: Consuming Streaming Responses
React components need to manage streaming state: accumulating text chunks, handling errors, and updating UI incrementally. The naive approach—updating state on every chunk—causes performance issues. Better patterns batch updates and optimize re-renders.
Basic Streaming Hook
// hooks/useStreamingChat.ts
import { useState, useCallback } from 'react';
interface Message {
role: 'user' | 'assistant';
content: string;
}
export function useStreamingChat() {
const [messages, setMessages] = useState([]);
const [isLoading, setIsLoading] = useState(false);
const [error, setError] = useState(null);
const [streamingContent, setStreamingContent] = useState('');
const sendMessage = useCallback(async (content: string) => {
// Add user message immediately
const userMessage: Message = { role: 'user', content };
setMessages(prev => [...prev, userMessage]);
setIsLoading(true);
setError(null);
setStreamingContent('');
try {
const response = await fetch('/api/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
message: content,
conversationHistory: messages
})
});
if (!response.ok) {
throw new Error('Stream request failed');
}
const reader = response.body?.getReader();
const decoder = new TextDecoder();
if (!reader) {
throw new Error('No response body');
}
let accumulatedContent = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
// Decode chunk
const chunk = decoder.decode(value, { stream: true });
accumulatedContent += chunk;
// Update streaming content
setStreamingContent(accumulatedContent);
}
// Move completed message to messages array
setMessages(prev => [...prev, {
role: 'assistant',
content: accumulatedContent
}]);
setStreamingContent('');
} catch (err) {
setError(err instanceof Error ? err.message : 'Unknown error');
} finally {
setIsLoading(false);
}
}, [messages]);
return {
messages,
streamingContent,
isLoading,
error,
sendMessage
};
}
This hook manages the complete streaming lifecycle. It maintains a messages array for completed exchanges and a separate streamingContent string for the currently generating message. When streaming completes, the content moves from streamingContent to messages. This separation prevents unnecessary re-renders—the messages array is stable during streaming, only streamingContent updates.
React Component Using Streaming Hook
// components/ChatInterface.tsx
import { useStreamingChat } from '@/hooks/useStreamingChat';
import { useRef, useEffect } from 'react';
export function ChatInterface() {
const { messages, streamingContent, isLoading, error, sendMessage } = useStreamingChat();
const [input, setInput] = useState('');
const messagesEndRef = useRef(null);
// Auto-scroll to bottom as new content streams
useEffect(() => {
messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
}, [messages, streamingContent]);
const handleSubmit = (e: React.FormEvent) => {
e.preventDefault();
if (input.trim()) {
sendMessage(input);
setInput('');
}
};
return (
{messages.map((msg, i) => (
{msg.content}
))}
{/* Streaming message */}
{streamingContent && (
{streamingContent}
▋
)}
{/* Error display */}
{error && (
Error: {error}
)}
);
}
The component displays completed messages and the currently streaming content separately. The streaming message includes a cursor indicator (▋) to show generation is active. Auto-scrolling ensures users see new content as it arrives without manual scrolling. The disabled input during loading prevents submitting multiple requests concurrently.
Handling SSE with EventSource
For backends using Server-Sent Events format, React can consume streams with the EventSource API. This provides automatic reconnection, event parsing, and error handling, but trades flexibility for convenience.
SSE Streaming Hook
// hooks/useSSEChat.ts
import { useState, useCallback } from 'react';
export function useSSEChat() {
const [messages, setMessages] = useState([]);
const [streamingContent, setStreamingContent] = useState('');
const [isLoading, setIsLoading] = useState(false);
const sendMessage = useCallback(async (content: string) => {
setMessages(prev => [...prev, { role: 'user', content }]);
setIsLoading(true);
setStreamingContent('');
// Note: EventSource only supports GET requests
// Need to send message ID and retrieve stream with GET
const messageId = await createMessage(content);
const eventSource = new EventSource(`/api/chat/stream/${messageId}`);
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'chunk') {
setStreamingContent(prev => prev + data.text);
}
if (data.type === 'done') {
setMessages(prev => [...prev, {
role: 'assistant',
content: streamingContent
}]);
setStreamingContent('');
setIsLoading(false);
eventSource.close();
}
};
eventSource.onerror = (error) => {
console.error('SSE error:', error);
setIsLoading(false);
eventSource.close();
};
return () => {
eventSource.close();
};
}, [streamingContent]);
return { messages, streamingContent, isLoading, sendMessage };
}
EventSource limitations include GET-only requests and no custom headers (beyond cookies). This makes it unsuitable for REST APIs requiring POST with JSON bodies or Bearer token authentication. For these cases, use Fetch with ReadableStream. EventSource's advantage is automatic reconnection—if the connection drops, it reconnects and resumes automatically.
Performance Optimization: Batched Updates
Updating React state on every chunk can cause performance issues. LLMs produce 50-100 tokens per second. If each token triggers a state update and re-render, you're rendering 50-100 times per second. This overwhelms React's reconciliation, causes frame drops, and degrades user experience.
Batching Strategy with RequestAnimationFrame
// hooks/useStreamingChatOptimized.ts
import { useState, useCallback, useRef } from 'react';
export function useStreamingChatOptimized() {
const [messages, setMessages] = useState([]);
const [streamingContent, setStreamingContent] = useState('');
const [isLoading, setIsLoading] = useState(false);
// Buffer for accumulating chunks between renders
const bufferRef = useRef('');
const rafIdRef = useRef(null);
const sendMessage = useCallback(async (content: string) => {
setMessages(prev => [...prev, { role: 'user', content }]);
setIsLoading(true);
bufferRef.current = '';
// Flush buffer to state (batched updates)
const flushBuffer = () => {
if (bufferRef.current) {
setStreamingContent(prev => prev + bufferRef.current);
bufferRef.current = '';
}
rafIdRef.current = null;
};
try {
const response = await fetch('/api/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message: content })
});
const reader = response.body?.getReader();
const decoder = new TextDecoder();
if (!reader) throw new Error('No reader');
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
// Add to buffer instead of immediate state update
bufferRef.current += chunk;
// Schedule flush if not already scheduled
if (rafIdRef.current === null) {
rafIdRef.current = requestAnimationFrame(flushBuffer);
}
}
// Final flush
if (rafIdRef.current !== null) {
cancelAnimationFrame(rafIdRef.current);
}
flushBuffer();
// Move to messages
setMessages(prev => [...prev, {
role: 'assistant',
content: streamingContent + bufferRef.current
}]);
setStreamingContent('');
} catch (error) {
console.error('Streaming error:', error);
} finally {
setIsLoading(false);
}
}, [streamingContent]);
return { messages, streamingContent, isLoading, sendMessage };
}
This optimization buffers chunks and updates state at most once per frame (60 FPS = 16.67ms intervals). Chunks arriving within a frame are batched into a single state update. This reduces render frequency from 50-100 per second to 60 per second maximum, significantly improving performance on lower-end devices.
Alternative: Fixed-Interval Batching
For even less aggressive updates, batch at fixed intervals (e.g., 100ms). This reduces renders to 10 per second, which is still fast enough that users perceive smooth streaming but much easier on React's reconciliation.
// Fixed-interval batching
const sendMessageWithIntervalBatching = useCallback(async (content: string) => {
let buffer = '';
// Flush buffer every 100ms
const flushInterval = setInterval(() => {
if (buffer) {
setStreamingContent(prev => prev + buffer);
buffer = '';
}
}, 100);
try {
const reader = await getStreamReader(content);
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
buffer += chunk;
}
} finally {
clearInterval(flushInterval);
// Final flush
if (buffer) {
setStreamingContent(prev => prev + buffer);
}
}
}, []);
Error Handling and Retry Logic
Streams can fail mid-transmission due to network interruptions, server errors, or rate limiting. Robust implementations detect failures and retry gracefully without losing context or forcing users to restart conversations.
Retry with Exponential Backoff
// hooks/useStreamingChatWithRetry.ts
interface RetryOptions {
maxRetries: number;
baseDelay: number;
maxDelay: number;
}
export function useStreamingChatWithRetry(options: RetryOptions = {
maxRetries: 3,
baseDelay: 1000,
maxDelay: 10000
}) {
const [messages, setMessages] = useState([]);
const [streamingContent, setStreamingContent] = useState('');
const [retryCount, setRetryCount] = useState(0);
const attemptStream = async (
content: string,
attempt: number = 0
): Promise => {
try {
await streamMessage(content, (chunk) => {
setStreamingContent(prev => prev + chunk);
});
// Success: reset retry count
setRetryCount(0);
} catch (error) {
if (attempt < options.maxRetries) {
const delay = Math.min(
options.baseDelay * Math.pow(2, attempt),
options.maxDelay
);
console.log(`Retrying in ${delay}ms (attempt ${attempt + 1}/${options.maxRetries})`);
setRetryCount(attempt + 1);
await new Promise(resolve => setTimeout(resolve, delay));
return attemptStream(content, attempt + 1);
} else {
throw new Error(`Failed after ${options.maxRetries} retries: ${error.message}`);
}
}
};
const sendMessage = useCallback(async (content: string) => {
setMessages(prev => [...prev, { role: 'user', content }]);
setStreamingContent('');
setRetryCount(0);
try {
await attemptStream(content);
setMessages(prev => [...prev, {
role: 'assistant',
content: streamingContent
}]);
setStreamingContent('');
} catch (error) {
// Final failure after all retries
setMessages(prev => [...prev, {
role: 'assistant',
content: `Error: ${error.message}`
}]);
}
}, [streamingContent]);
return {
messages,
streamingContent,
retryCount,
sendMessage
};
}
Exponential backoff prevents overwhelming servers during outages. First retry waits 1 second, second waits 2 seconds, third waits 4 seconds, capped at 10 seconds. This gives transient failures time to resolve while avoiding indefinite waits. After max retries, fail gracefully with an error message that lets users manually retry.
Partial Response Recovery
When streams fail mid-response, you can preserve partial content instead of discarding it. This prevents losing valuable partial answers and gives users something to work with even when full generation fails.
// Preserve partial content on stream failure
const sendMessageWithPartialRecovery = async (content: string) => {
setStreamingContent('');
let partialContent = '';
try {
await streamMessage(content, (chunk) => {
partialContent += chunk;
setStreamingContent(partialContent);
});
// Complete success
setMessages(prev => [...prev, {
role: 'assistant',
content: partialContent
}]);
} catch (error) {
// Preserve partial content with error indicator
if (partialContent.length > 50) {
setMessages(prev => [...prev, {
role: 'assistant',
content: partialContent + '\n\n[Response interrupted. Click retry to continue.]',
interrupted: true
}]);
} else {
// Too little content to be useful, show error
showError(error.message);
}
} finally {
setStreamingContent('');
}
};
Integration with Popular AI SDKs
Major AI providers offer JavaScript SDKs with built-in streaming support. These SDKs handle protocol details and provide typed interfaces, simplifying integration compared to raw Fetch API usage.
Vercel AI SDK: Universal Streaming
The Vercel AI SDK provides a useChat hook that handles streaming for multiple AI providers (OpenAI, Anthropic, Hugging Face) with a unified interface.
// Using Vercel AI SDK
import { useChat } from 'ai/react';
export function ChatComponent() {
const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat({
api: '/api/chat',
onError: (error) => {
console.error('Chat error:', error);
}
});
return (
{messages.map(msg => (
{msg.content}
))}
);
}
// Backend API route (Next.js)
import { OpenAIStream, StreamingTextResponse } from 'ai';
import OpenAI from 'openai';
export async function POST(req: Request) {
const { messages } = await req.json();
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
const response = await openai.chat.completions.create({
model: 'gpt-4-turbo',
stream: true,
messages
});
const stream = OpenAIStream(response);
return new StreamingTextResponse(stream);
}
The Vercel AI SDK abstracts away streaming complexity. The useChat hook manages message history, streaming state, and error handling. Backend integration is provider-agnostic—the same frontend code works with OpenAI, Anthropic, or custom models by changing only the backend API implementation.
LangChain.js Streaming
LangChain provides streaming callbacks for complex chains and agents. This enables streaming not just LLM responses but intermediate steps in multi-stage workflows.
// LangChain streaming in React
import { ChatOpenAI } from '@langchain/openai';
async function streamWithLangChain(
message: string,
onToken: (token: string) => void
) {
const model = new ChatOpenAI({
modelName: 'gpt-4-turbo',
streaming: true,
callbacks: [{
handleLLMNewToken(token: string) {
onToken(token);
}
}]
});
await model.invoke([
{ role: 'user', content: message }
]);
}
// React component
function useLangChainStreaming() {
const [content, setContent] = useState('');
const sendMessage = async (message: string) => {
setContent('');
await streamWithLangChain(message, (token) => {
setContent(prev => prev + token);
});
};
return { content, sendMessage };
}
Testing Streaming Implementations
Streaming code is harder to test than synchronous code. You need to mock streaming APIs, simulate network delays and failures, and verify UI updates at intermediate states. Proper testing prevents production issues that only appear under specific network conditions.
Mock Streaming API
// test/mocks/mockStreamingAPI.ts
export function createMockStream(text: string, chunkSize: number = 5, delay: number = 10) {
const chunks = text.match(new RegExp(`.{1,${chunkSize}}`, 'g')) || [];
const stream = new ReadableStream({
async start(controller) {
for (const chunk of chunks) {
await new Promise(resolve => setTimeout(resolve, delay));
controller.enqueue(new TextEncoder().encode(chunk));
}
controller.close();
}
});
return new Response(stream);
}
// Test usage
test('should display streaming content incrementally', async () => {
const mockResponse = createMockStream('Hello world from AI', 5, 10);
global.fetch = jest.fn().mockResolvedValue(mockResponse);
const { result } = renderHook(() => useStreamingChat());
await act(async () => {
result.current.sendMessage('Test message');
});
// Wait for first chunk
await waitFor(() => {
expect(result.current.streamingContent).toContain('Hello');
});
// Wait for complete response
await waitFor(() => {
expect(result.current.streamingContent).toBe('Hello world from AI');
});
});
Testing Error Recovery
// Test stream interruption and retry
test('should retry on stream failure', async () => {
let attemptCount = 0;
global.fetch = jest.fn().mockImplementation(() => {
attemptCount++;
if (attemptCount < 2) {
// First attempt fails mid-stream
return Promise.resolve({
ok: true,
body: new ReadableStream({
start(controller) {
controller.enqueue(new TextEncoder().encode('Partial'));
controller.error(new Error('Connection lost'));
}
})
});
}
// Second attempt succeeds
return createMockStream('Partial response complete');
});
const { result } = renderHook(() => useStreamingChatWithRetry({
maxRetries: 3,
baseDelay: 10
}));
await act(async () => {
result.current.sendMessage('Test');
});
await waitFor(() => {
expect(attemptCount).toBe(2);
expect(result.current.messages).toHaveLength(2); // User + complete assistant
});
});
Frequently Asked Questions
Should I use SSE or ReadableStream for streaming?
Use ReadableStream with Fetch for most cases—it's more flexible and works with POST requests and custom headers. Use SSE (EventSource) when you need automatic reconnection and can use GET requests. Modern browsers support both, but ReadableStream is better supported by edge runtimes and serverless platforms.
How do I handle authentication with streaming requests?
Include auth tokens in request headers with Fetch (works with ReadableStream). EventSource only sends cookies, not custom headers, making Bearer token auth impossible. If using SSE with token auth, pass tokens as query parameters (less secure) or switch to ReadableStream with Fetch.
Can I cancel streaming requests in progress?
Yes, use AbortController with Fetch API. Create an AbortController, pass its signal to fetch(), and call abort() to cancel. The stream reader will receive a cancellation and stop processing. EventSource has a close() method that terminates the connection.
How do I implement typing indicators during streaming?
Add a visual indicator (animated dots, cursor) to the streaming message component. The presence of streamingContent (non-empty string) indicates active generation. Remove the indicator when streaming completes and content moves to the messages array.
What about mobile performance with streaming?
Mobile devices have less processing power and more variable network conditions. Implement batching (update every 100-200ms, not per chunk) to reduce CPU usage. Use requestAnimationFrame for smooth UI updates. Test on actual mobile devices, not just browser devtools mobile simulation.
How do I save streaming conversations to database?
Save complete messages after streaming finishes, not during. Don't write to database on every chunk—wait until the full response is available, then save both user and assistant messages in a single transaction. This reduces database writes by 100x compared to per-chunk saves.
Can I stream responses from self-hosted models?
Yes, but you need to implement streaming in your model serving layer. Tools like Ollama, vLLM, and TGI support streaming APIs. The frontend code is identical—streaming is a transport concern, not a model concern. The backend streaming implementation depends on your model serving infrastructure.
How do I handle rate limiting with streaming?
Implement rate limiting before starting the stream, not during. Check rate limits when the request arrives, reject with 429 status if exceeded. Don't start streaming and then cancel mid-stream—this wastes API costs and creates poor UX. Queue requests that would exceed limits and process when capacity becomes available.
What about accessibility with streaming content?
Use ARIA live regions to announce streaming content to screen readers. Update aria-live="polite" regions with new content, but batch updates to avoid overwhelming users with constant announcements. Provide a "stop generation" button for users who want to terminate streaming before completion.
How do I debug streaming issues in production?
Log stream lifecycle events: connection opened, first chunk received, chunks per second, total duration, errors. Track latency to first token separately from total latency. Monitor error rates by error type (network, timeout, API error). Use distributed tracing to correlate frontend streaming with backend generation.
Conclusion
Streaming AI responses transforms user experience from waiting for complete responses to engaging with content as it generates. Implement streaming for responses longer than 100 tokens using ReadableStream with Fetch for maximum flexibility. Optimize performance with batched updates using requestAnimationFrame or fixed intervals. Handle errors gracefully with retry logic and partial content preservation. Use established patterns like Vercel AI SDK's useChat hook to avoid reinventing streaming infrastructure.
Start with simple per-chunk updates and measure performance before optimizing. Add batching only if profiling shows render performance issues. Focus on error handling and retry logic—streaming's biggest challenges aren't technical complexity but graceful degradation when networks fail. The patterns in this article provide production-ready foundations that scale from prototype to high-traffic applications.