2 Min. Lesezeit

Next.js • Edge • LLM

Streaming LLM responses from the edge

The AI assistant on this site starts replying before the model has finished thinking. That is not a trick — it is the difference between buffering a full completion and streaming it token by token. For a conversational UI, that difference is the entire perceived performance.

Here is the architecture I reach for when an LLM feature has to feel fast, stay cheap, and never hard-fail in front of a user.

Why streaming matters

A 400-token answer at a typical generation speed takes a few seconds to produce in full. If you wait for the whole thing, the user stares at a spinner for those seconds. If you stream, the first words appear in a few hundred milliseconds and the rest arrives as fast as it is generated. Same total time, completely different feel — and users read along while it writes, so the wait disappears.

Run it on the edge

Streaming is a network game, so latency to the user matters more than CPU. The edge runtime puts the route close to the visitor and gives you Web-standard streaming primitives out of the box. In the App Router it is a one-line opt-in on the route handler:

export const runtime = "edge";

export async function POST(request: Request) {
  // ...validate, rate-limit, then stream
}

No Node APIs, no cold-start penalty worth worrying about, and the same Response/ReadableStream you would use in any modern runtime.

Stream with a ReadableStream

The provider SDK hands you chunks as they arrive. You forward each one into a ReadableStream and return it as plain text — the browser receives bytes the moment they are enqueued:

const encoder = new TextEncoder();

const stream = new ReadableStream({
  async start(controller) {
    try {
      await streamCompletion(systemPrompt, messages, (chunk) => {
        controller.enqueue(encoder.encode(chunk));
      });
      controller.close();
    } catch (err) {
      controller.error(err);
    }
  },
});

return new Response(stream, {
  headers: {
    "Content-Type": "text/plain; charset=utf-8",
    "Cache-Control": "no-store",
  },
});

On the client you read the response body with a stream reader and append each chunk to the message as it lands. No polling, no websockets, no extra dependency.

Always have a fallback

An LLM call can be rate-limited, time out, or simply have no API key configured in a given environment. None of that should show the user an error. The route on this site degrades to a rule-based responder whenever the live path is unavailable, and signals the mode in a response header so the UI can label it honestly:

if (!resolveProvider()) {
  return Response.json(
    { mode: "fallback" },
    { headers: { "X-Terminal-Mode": "fallback" } },
  );
}

The feature is therefore never down — it just gets quieter when the model is unreachable. That property is worth more than any single clever prompt.

What this buys you

  • First token in a few hundred milliseconds instead of a multi-second spinner.
  • Edge deployment, so latency tracks the user's location, not your data centre's.
  • A graceful fallback that keeps the feature alive without an API key.
  • Zero extra client dependencies — Web streams all the way down.

If you are wiring an LLM into a product flow and want it to feel native rather than bolted on, this is the shape I would start from.

Ein ähnliches Projekt?

Buchen Sie einen 30-minütigen Scoping-Call und wir prüfen die Idee gemeinsam.