Next.js • Edge • LLM
Streaming LLM responses from the edge
The AI assistant on this site starts replying before the model has finished thinking. That is not a trick — it is the difference between buffering a full completion and streaming it token by token. For a conversational UI, that difference is the entire perceived performance.
Here is the architecture I reach for when an LLM feature has to feel fast, stay cheap, and never hard-fail in front of a user.
Why streaming matters
A 400-token answer at a typical generation speed takes a few seconds to produce in full. If you wait for the whole thing, the user stares at a spinner for those seconds. If you stream, the first words appear in a few hundred milliseconds and the rest arrives as fast as it is generated. Same total time, completely different feel — and users read along while it writes, so the wait disappears.
Run it on the edge
Streaming is a network game, so latency to the user matters more than CPU. The edge runtime puts the route close to the visitor and gives you Web-standard streaming primitives out of the box. In the App Router it is a one-line opt-in on the route handler:
export const runtime = "edge";
export async function POST(request: Request) {
// ...validate, rate-limit, then stream
}No Node APIs, no cold-start penalty worth worrying about, and the same Response/ReadableStream you would use in any modern runtime.
Stream with a ReadableStream
The provider SDK hands you chunks as they arrive. You forward each one into a ReadableStream and return it as plain text — the browser receives bytes the moment they are enqueued:
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
await streamCompletion(systemPrompt, messages, (chunk) => {
controller.enqueue(encoder.encode(chunk));
});
controller.close();
} catch (err) {
controller.error(err);
}
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/plain; charset=utf-8",
"Cache-Control": "no-store",
},
});On the client you read the response body with a stream reader and append each chunk to the message as it lands. No polling, no websockets, no extra dependency.
Always have a fallback
An LLM call can be rate-limited, time out, or simply have no API key configured in a given environment. None of that should show the user an error. The route on this site degrades to a rule-based responder whenever the live path is unavailable, and signals the mode in a response header so the UI can label it honestly:
if (!resolveProvider()) {
return Response.json(
{ mode: "fallback" },
{ headers: { "X-Terminal-Mode": "fallback" } },
);
}The feature is therefore never down — it just gets quieter when the model is unreachable. That property is worth more than any single clever prompt.
What this buys you
- First token in a few hundred milliseconds instead of a multi-second spinner.
- Edge deployment, so latency tracks the user's location, not your data centre's.
- A graceful fallback that keeps the feature alive without an API key.
- Zero extra client dependencies — Web streams all the way down.
If you are wiring an LLM into a product flow and want it to feel native rather than bolted on, this is the shape I would start from.
Έχετε παρόμοιο έργο;
Κλείστε μια 30λεπτη κλήση και ας εξετάσουμε την ιδέα μαζί.