Long-running AI chat conversations in Next.js apps built with the Vercel AI SDK will eventually hit the LLM's context window token limit, causing requests to silently fail. Conversation compaction solves this by summarizing older messages and only sending recent ones to the model, along with the summary. This keeps the conversation going indefinitely without losing important context.
In this article, we'll walk through how to implement automatic conversation compaction in a Next.js app using the Vercel AI SDK. We'll start with a basic chat setup, detect when the context window limit is approaching, and then build the full compaction mechanism.
What you'll learn:
- Setting up a basic chat with
useChatandToolLoopAgent - Detecting token limit errors from the LLM API response
- Tracking token usage proactively with
messageMetadata - Building a compaction agent that summarizes conversations
- Creating a React hook that triggers compaction automatically
- Wiring compaction into the
useChatretry flow withregenerate()
Prerequisites:
- Next.js 14+ with App Router
- Vercel AI SDK v6 (
aiand@ai-sdk/react) - An LLM API key (Anthropic, OpenAI, etc.)
- Basic familiarity with React hooks and streaming
Setting Up a Next.js Chat App with Vercel AI SDK
Before diving into compaction, let's establish the foundation. A typical Vercel AI SDK chat app has three pieces: a useChat hook on the client, a streaming API route, and an agent configuration.
The Agent
The agent is a ToolLoopAgent that wraps your LLM model and system prompt:
import { ToolLoopAgent } from 'ai';
export const chatAgent = async () =>
new ToolLoopAgent({
model: 'anthropic/claude-sonnet-4-5',
instructions: 'You are a helpful assistant.',
tools: {},
});
The Next.js API Route
The API route receives messages from the client, converts them to model messages, and streams the response back:
// app/api/chat/route.ts
import { convertToModelMessages } from 'ai';
import { chatAgent } from '@/agent';
export async function POST(request: Request) {
const { messages } = await request.json();
const agent = await chatAgent();
const response = await agent.stream({
messages: await convertToModelMessages(messages),
});
return response.toUIMessageStreamResponse({
originalMessages: messages,
});
}
The Client Hook
On the frontend, useChat handles the streaming connection:
// hooks/useChatStream.ts
import { useChat, DefaultChatTransport } from '@ai-sdk/react';
export function useChatStream() {
const { messages, sendMessage, status } = useChat({
transport: new DefaultChatTransport({ api: '/api/chat' }),
});
return { messages, sendMessage, isStreaming: status === 'streaming' };
}
This works great — until the conversation gets long enough to exceed the LLM's context window token limit.
Detecting Context Window Token Limits
There are two ways to detect that a conversation is approaching (or has exceeded) the context window token limit: reactively by catching API errors, and proactively by tracking token usage.
Catching Token Limit Errors
When the LLM rejects a request because the prompt is too long, the error surfaces through the onError callback on your toUIMessageStreamResponse. We can detect this in our Next.js API route by inspecting the error's response body:
// app/api/chat/route.ts
return response.toUIMessageStreamResponse({
originalMessages: messages,
onError: (error) => {
const responseBody: string | undefined = (
error as { cause?: { responseBody?: string } }
)?.cause?.responseBody;
try {
if (responseBody) {
const errorText = JSON.stringify(JSON.parse(responseBody)).toLowerCase();
if (
errorText.includes('prompt is too long') ||
errorText.includes('input is too long') ||
errorText.includes('maximum context length')
) {
return 'TOKEN_LIMIT_EXCEEDED';
}
}
} catch {
return 'GENERIC_ERROR';
}
return 'GENERIC_ERROR';
},
});The error code string is what the client receives as error.message in the useChat hook.
Tracking Token Usage Proactively
A better approach is to track how much of the context window we're using before hitting the token limit. The Vercel AI SDK exposes token usage through the messageMetadata callback. We can capture the inputTokens from each LLM response:
// app/api/chat/route.ts
function createMetadataHandler() {
let lastStepUsage: { inputTokens?: number; outputTokens?: number };
return ({ part }) => {
if (part.type === 'finish-step') {
lastStepUsage = part.usage;
return {};
}
if (part.type === 'finish') {
return { totalUsage: part.totalUsage, lastStepUsage };
}
return {};
};
}
return response.toUIMessageStreamResponse({
originalMessages: messages,
messageMetadata: createMetadataHandler(),
});
On the client, we can then compute the context window usage percentage by reading the metadata from the last assistant message:
// hooks/useTokenUsage.ts
const MODEL_CONTEXT_WINDOW = 200_000; // Claude Sonnet's context window
const WARNING_THRESHOLD = 0.75;
export function useTokenUsage(messages) {
for (let i = messages.length - 1; i >= 0; i--) {
const message = messages[i];
if (message.role === 'assistant' && message.metadata?.lastStepUsage) {
const inputTokens = message.metadata.lastStepUsage.inputTokens;
if (inputTokens !== undefined) {
return {
usagePercentage: inputTokens / MODEL_CONTEXT_WINDOW,
usageLevel: inputTokens / MODEL_CONTEXT_WINDOW >= WARNING_THRESHOLD ? 'warning' : 'normal',
};
}
}
}
return { usagePercentage: undefined, usageLevel: 'normal' };
}
This gives you the data to show a "75% context used" indicator in your UI, alerting the user before the token limit breaks things.
Why Not Just Truncate Old Messages?
Before we build the compaction solution, it's worth addressing the obvious alternative.
Simple truncation — dropping the oldest N messages — loses context. The user might have specified important constraints early in the conversation that the LLM needs to remember. Compaction preserves the meaning while reducing the tokens.
Using a model with a larger context window helps, but it's not a complete solution. Larger context windows are more expensive per request and slower. Even with a 1M token context window, a production app with thousands of concurrent users will benefit from sending fewer tokens per request. Eventually, any context window can be exceeded.
Conversation compaction gives you the best of both worlds: the LLM retains full awareness of the conversation history without paying for all the tokens.
Building a Conversation Summarization Agent
Compaction works by summarizing the conversation into a short paragraph, then prepending that summary to only the most recent messages. We need a dedicated agent for this — a lightweight one that just summarizes:
// agent/compaction.agent.ts
import { ToolLoopAgent } from 'ai';
const COMPACTION_SYSTEM_PROMPT = `You are a conversation compaction agent. Your job is to take a long conversation history and produce a concise summary that preserves all essential context.
<rules>
- Summarize all messages into a brief context paragraph
- Remove redundant information: verbose explanations, intermediate steps, and repeated instructions
- Preserve: user requirements, key decisions, constraints, and unresolved questions
- The summary should allow the AI to continue the conversation seamlessly
</rules>
<important>
- Keep the summary under 500 words
- Focus on WHAT was discussed and WHY, not HOW
- Return ONLY the summary text, nothing else
</important>`;
export const compactionAgent = () =>
new ToolLoopAgent({
model: 'anthropic/claude-sonnet-4-5',
instructions: COMPACTION_SYSTEM_PROMPT,
tools: {},
});
Note that we're using a smaller, cheaper model for compaction since it's a straightforward summarization task — no need to burn tokens on your most capable LLM.
Creating the Compaction API Route in Next.js
The compaction endpoint receives the conversation messages, strips out any heavy content (like generated code blocks or tool call results), and asks the compaction agent to summarize:
// app/api/compact/route.ts
import { convertToModelMessages } from 'ai';
import { compactionAgent } from '@/agent/compaction.agent';
export async function POST(request: Request) {
const { messages, previousSummary } = await request.json();
const previousSummaryBlock = previousSummary
? `<previous_compaction_summary>\n${previousSummary}\n</previous_compaction_summary>\n\n`
: '';
const agent = compactionAgent();
const response = await agent.generate({
messages: [
...(await convertToModelMessages(messages)),
{
role: 'user',
content: `${previousSummaryBlock}Compact this conversation.`,
},
],
});
return Response.json({ summary: response.text });
}
A key detail: we pass any previousSummary to the agent so it can build on prior compactions rather than losing earlier context. This means compaction can happen multiple times throughout a conversation's lifetime as the context window fills up again.
Building a React Hook for Chat Compaction
The client-side hook manages the compaction state and triggers the compaction API call. It tracks the summary and the index of the last message that was compacted:
// hooks/useChatCompaction.ts
import { useState, useEffect, useRef } from 'react';
export function useChatCompaction(chatId: string) {
const [compactionState, setCompactionState] = useState({
summary: null,
compactedUpToIndex: 0,
});
const [compactionVersion, setCompactionVersion] = useState(0);
const [isCompacting, setIsCompacting] = useState(false);
// Reset when switching chats
useEffect(() => {
setCompactionState({ summary: null, compactedUpToIndex: 0 });
setCompactionVersion(0);
setIsCompacting(false);
}, [chatId]);
const compactConversation = async (messages, previousSummary, previousCompactedUpToIndex) => {
setIsCompacting(true);
try {
const messagesToCompact = messages.slice(previousCompactedUpToIndex ?? 0);
const response = await fetch('/api/compact', {
method: 'POST',
body: JSON.stringify({
messages: messagesToCompact,
previousSummary: previousSummary ?? undefined,
}),
});
if (!response.ok) throw new Error('Compaction failed');
const { summary } = await response.json();
setCompactionState({
summary,
compactedUpToIndex: messages.length - 1,
});
setCompactionVersion((v) => v + 1);
} finally {
setIsCompacting(false);
}
};
return {
compactConversation,
compactionState,
compactionVersion,
isCompacting,
};
}
Integrating Compaction into useChat with Auto-Retry
The final piece is integrating compaction into the main useChat hook. The flow is:
- User sends a message and gets a
TOKEN_LIMIT_EXCEEDEDerror - Instead of showing the error immediately, we trigger compaction
- Once compaction finishes, we retry the request with the summary
- The Next.js API route prepends the summary as a system message and only sends recent messages
Here's the updated chat hook:
// hooks/useChatStream.ts
import { useChat, DefaultChatTransport } from '@ai-sdk/react';
import { useCallback, useEffect, useRef } from 'react';
import { useChatCompaction } from './useChatCompaction';
const RECENT_MESSAGES_BUFFER = 6;
export function useChatStream() {
const chatId = 'current-chat-id'; // however you resolve this
const hasCompactedRef = useRef(false);
const {
compactConversation,
compactionState,
compactionVersion,
isCompacting,
} = useChatCompaction(chatId);
const {
messages,
sendMessage,
status,
error,
clearError,
regenerate,
} = useChat({
id: chatId,
transport: new DefaultChatTransport({
api: '/api/chat',
body: {
summary: compactionState.summary,
compactedUpToIndex: compactionState.compactedUpToIndex,
},
}),
onError: async (error) => {
if (error.message === 'TOKEN_LIMIT_EXCEEDED' && !hasCompactedRef.current) {
hasCompactedRef.current = true;
try {
await compactConversation(
messages,
compactionState.summary,
compactionState.compactedUpToIndex,
);
clearError();
} catch {
hasCompactedRef.current = false;
}
}
},
});
// When compaction completes, regenerate with the summary
useEffect(() => {
if (!compactionState.summary || compactionVersion === 0) return;
regenerate({ body: compactionState });
hasCompactedRef.current = false;
}, [compactionVersion, regenerate, compactionState]);
// Suppress error while compaction is in-flight
const exposedError =
error?.message === 'TOKEN_LIMIT_EXCEEDED' && isCompacting ? undefined : error;
return {
messages,
isStreaming: status === 'streaming',
isCompacting,
sendMessage,
error: exposedError,
clearError,
};
}
And the updated Next.js API route that handles the summary and message slicing:
// app/api/chat/route.ts
import { convertToModelMessages } from 'ai';
import { chatAgent } from '@/agent';
export async function POST(request: Request) {
const { messages, summary, compactedUpToIndex } = await request.json();
// Only send recent messages if we have a compaction
const recentMessages = compactedUpToIndex !== undefined
? messages.slice(Math.max(0, compactedUpToIndex - 6))
: messages;
const modelMessages = await convertToModelMessages(recentMessages);
// Prepend the summary as context
if (summary) {
modelMessages.unshift({
role: 'system',
content: `<previous_conversation_summary>${summary}</previous_conversation_summary>`,
});
}
const agent = await chatAgent();
const response = await agent.stream({ messages: modelMessages });
return response.toUIMessageStreamResponse({
originalMessages: messages,
messageMetadata: createMetadataHandler(),
onError: (error) => { /* error detection from earlier */ },
});
}
How Automatic Conversation Compaction Works End-to-End
Let's walk through the full cycle:
- The user has a long conversation — token usage climbs to 90%, 95% of the context window...
- The next message fails with a "prompt too long" error from the LLM
- The
onErrorcallback fires, triggeringcompactConversation() - The compaction agent summarizes the conversation into a ~500 word paragraph
compactionVersionincrements, triggering theuseEffectregenerate()retries the last request, this time sending the summary + only the last 6 messages- The conversation continues seamlessly — the user doesn't even notice
If the conversation continues to grow and hits the context window token limit again, the process repeats. Each compaction builds on the previous summary, so context accumulates without the token count growing unboundedly.
Conclusion
Conversation compaction solves a real problem that every production AI chat app built with Next.js and the Vercel AI SDK will eventually face. The SDK provides the primitives we need — useChat with onError, regenerate(), messageMetadata for token tracking, and convertToModelMessages for the backend.
The key insight is treating compaction as a transparent retry mechanism: catch the token limit error, summarize, slice the messages, and retry — all before the user even sees a failure. Combined with proactive context window usage tracking, you can surface indicators in the UI so users understand why compaction happened.
Next steps to consider:
- Persist the compaction summary to a database so returning users resume with context intact
- Trigger compaction proactively at 80% context window usage instead of waiting for a failure
- Add a UI indicator showing "X% context used" with a progress bar
- Strip heavy content (code blocks, tool call results) before sending to the compaction agent to reduce cost

.jpg)



.jpeg)