Case study

AI Story Generator — orchestration, fallback & rate limiting

A serverless LLM feature built to behave well when the AI does not. The story is not "I called an API"—it's prompt orchestration on Replicate (Llama 3), a deterministic local generator that takes over when credits run out, per-IP rate limiting to cap cost and abuse, and a multi-turn conversation model so users iterate on a story instead of one-shotting it. Built vanilla, on purpose, and assembled AI-native.

AI Story Generator main form

1. Problem

AI demos look great in ideal conditions and fall apart under real constraints: the provider runs out of credits, returns a transient 5xx, times out on a cold start, or someone hammers the endpoint and runs up the bill. A "generate a story" button that white-screens the moment Replicate returns HTTP 402 is not a feature—it's a liability.

So the goal was production-minded AI integration: an explicit request contract, controlled generation settings, a path that never leaves the user stranded when the model is unavailable, and guardrails on cost and abuse—all without a framework getting in the way of the actual engineering.

2. Architecture

One serverless endpoint, a resilient client, and a fallback that lives entirely in the browser—so the feature degrades in layers instead of failing all at once:

3. AI orchestration

The endpoint owns the whole generation flow. Length maps to a token budget and a paragraph count, the conversation is flattened into a single prompt, and the model is called with controlled settings—so output stays bounded and the API secret never leaves the server.

const lengthSettings = {
  short:  { maxTokens: 300, paragraphs: 3 },
  medium: { maxTokens: 600, paragraphs: 5 },
  long:   { maxTokens: 900, paragraphs: 7 },
};

const prompt = buildPrompt(messages, tone, settings.paragraphs);

const output = await replicateClient.run("meta/meta-llama-3-8b-instruct", {
  input: { prompt, max_new_tokens: settings.maxTokens, temperature: 0.75 },
});
const story = Array.isArray(output) ? output.join("") : output;

4. Fallback & resilience

Resilience is layered. The server classifies failures into explicit HTTP contracts; the client retries the transient ones and, when the AI genuinely can't deliver, hands off to a deterministic local generator so the user always gets a story.

Server: classify, don't crash

try {
  const output = await replicateClient.run(model, { input });
  return res.status(200).json({ output: story });
} catch (err) {
  if (isReplicateCreditsError(err)) {
    return res.status(402).json({ error: "replicate_no_credits" });
  }
  console.error("replicate_failed", err);
  return res.status(500).json({ error: "replicate_failed" });
}

Client: retry, then degrade

The deliberate trade-off: the fallback is template-based, not "AI-quality." That's the point—reliability and demo continuity beat an AI-only purity that white-screens the moment a third party has a bad day.

5. Rate limiting

An LLM endpoint is a cost endpoint. A small fixed-window limiter keyed by client IP caps requests before any Replicate call is made—protecting the bill and blunting abuse—and reports its state back through standard headers.

function checkKey(key) {
  const now = Date.now();
  const entry = store.get(key);
  if (!entry || now >= entry.resetAt) {
    store.set(key, { count: 1, resetAt: now + windowMs }); // 60s window
    return { allowed: true, remaining: maxRequests - 1 };
  }
  if (entry.count >= maxRequests) {            // default 10 / window
    return { allowed: false, remaining: 0,
             retryAfterSec: Math.ceil((entry.resetAt - now) / 1000) };
  }
  entry.count += 1;
  return { allowed: true, remaining: maxRequests - entry.count };
}

Honest scope: the store is in-memory, so the limit is per serverless instance—right for a portfolio demo and abuse-blunting. The clear upgrade path is a shared store (Redis/Upstash) for a global limit across instances.

6. Multi-turn memory

A story isn't one prompt—it's a conversation. The client keeps an in-memory messages array of role/content turns and sends the whole history on every call, so each continuation is conditioned on everything written so far. The same array works for the AI and the fallback path.

// New story: reset the conversation
messages = [{ role: "user", content: seed }];

// Continue: append the user's nudge, resend full history
messages.push({ role: "user", content: prompt || "Continue the story." });
const res = await requestStory({ messages, tone, length }, controller.signal);
messages.push({ role: "assistant", content: story });

7. AI-native builder

Beyond consuming an LLM at runtime, this project is where I lean on LLMs as a build tool—and the workflow only works because of the constraints around it.

8. Outcomes