Case study

AI Story Generator — orchestration, fallback & rate limiting

A serverless LLM feature built to behave well when the AI does not. The story is not "I called an API"—it's prompt orchestration on Replicate (Llama 3), a deterministic local generator that takes over when credits run out, per-IP rate limiting to cap cost and abuse, and a multi-turn conversation model so users iterate on a story instead of one-shotting it. Built vanilla, on purpose, and assembled AI-native.

Live Repo ← Back to portfolio

1. Problem

AI demos look great in ideal conditions and fall apart under real constraints: the provider runs out of credits, returns a transient 5xx, times out on a cold start, or someone hammers the endpoint and runs up the bill. A "generate a story" button that white-screens the moment Replicate returns HTTP 402 is not a feature—it's a liability.

So the goal was production-minded AI integration: an explicit request contract, controlled generation settings, a path that never leaves the user stranded when the model is unavailable, and guardrails on cost and abuse—all without a framework getting in the way of the actual engineering.

2. Architecture

One serverless endpoint, a resilient client, and a fallback that lives entirely in the browser—so the feature degrades in layers instead of failing all at once:

┌─────────────┐  POST {messages,   ┌──────────────────────────┐
│   Browser   │  tone, length}     │  /api/generate-stories   │
│ (vanilla JS)│ ─────────────────▶ │  (Vercel Serverless)     │
└──────┬──────┘                    │                          │
       │                           │  1. rate-limit (per IP)  │
       │  fetchWithResilience      │  2. validate request     │
       │  timeout + retry          │  3. buildPrompt()        │
       │                           │  4. replicate.run(...)   │
       │                           └────────────┬─────────────┘
       │                                        │
       │  402 / 5xx / timeout      ┌────────────▼─────────────┐
       └─────────────────────────▶│  Replicate (Llama 3 8B)  │
       │                           └──────────────────────────┘
       ▼
┌──────────────────────┐   on failure   localStorage: theme · story history
│ generateLocalStory() │ ◀── deterministic fallback, runs offline
└──────────────────────┘

api/generate-stories.js — serverless handler: rate limit → validate → build prompt → call Replicate → classify errors into HTTP contracts.
public/js/http.js — fetchWithResilience: AbortController timeout, exponential retry on 429/502/503, typed HttpError.
public/js/localGenerator.js — deterministic, offline story generator used when the AI path can't deliver.
public/js/app.js — orchestrates the multi-turn conversation, triggers fallback, and persists history to localStorage.

3. AI orchestration

The endpoint owns the whole generation flow. Length maps to a token budget and a paragraph count, the conversation is flattened into a single prompt, and the model is called with controlled settings—so output stays bounded and the API secret never leaves the server.

const lengthSettings = {
  short:  { maxTokens: 300, paragraphs: 3 },
  medium: { maxTokens: 600, paragraphs: 5 },
  long:   { maxTokens: 900, paragraphs: 7 },
};

const prompt = buildPrompt(messages, tone, settings.paragraphs);

const output = await replicateClient.run("meta/meta-llama-3-8b-instruct", {
  input: { prompt, max_new_tokens: settings.maxTokens, temperature: 0.75 },
});
const story = Array.isArray(output) ? output.join("") : output;

Controlled generation — tone and length are first-class inputs; length caps max_new_tokens so cost and latency stay predictable.
Prompt construction — buildPrompt() flattens the role/content history into one instruction and asks for ONLY the story text, no preamble.
Output normalization — Replicate streams an array of tokens; the handler joins them into a single string before returning.
Testable in isolation — getLengthSettings, buildPrompt, and the handler are pure exports, unit-tested with node --test against a mocked Replicate client.

4. Fallback & resilience

Resilience is layered. The server classifies failures into explicit HTTP contracts; the client retries the transient ones and, when the AI genuinely can't deliver, hands off to a deterministic local generator so the user always gets a story.

Server: classify, don't crash

try {
  const output = await replicateClient.run(model, { input });
  return res.status(200).json({ output: story });
} catch (err) {
  if (isReplicateCreditsError(err)) {
    return res.status(402).json({ error: "replicate_no_credits" });
  }
  console.error("replicate_failed", err);
  return res.status(500).json({ error: "replicate_failed" });
}

isReplicateCreditsError() string-matches 402 / "payment required" / "insufficient" / "credit" so the exhausted-credits case becomes a clean, documented 402.
Explicit contracts — 400 invalid request, 402 no credits, 405 wrong method, 429 rate limited, 500 unexpected. The client branches on these, not on guesswork.

Client: retry, then degrade

fetchWithResilience wraps fetch with an AbortController timeout (8s) and exponential-backoff retry on 429/502/503; AbortError becomes a typed TIMEOUT HttpError.
On 402, any non-OK status, an empty body, or a thrown error, app.js calls generateLocalStory(seed, tone, length)—a deterministic generator that runs entirely offline.
Transparency — a one-time, session-scoped notice tells the user they're in fallback mode ("Out of credits" vs "AI unavailable"), so the degraded path is honest, not hidden.

The deliberate trade-off: the fallback is template-based, not "AI-quality." That's the point—reliability and demo continuity beat an AI-only purity that white-screens the moment a third party has a bad day.

5. Rate limiting

An LLM endpoint is a cost endpoint. A small fixed-window limiter keyed by client IP caps requests before any Replicate call is made—protecting the bill and blunting abuse—and reports its state back through standard headers.

function checkKey(key) {
  const now = Date.now();
  const entry = store.get(key);
  if (!entry || now >= entry.resetAt) {
    store.set(key, { count: 1, resetAt: now + windowMs }); // 60s window
    return { allowed: true, remaining: maxRequests - 1 };
  }
  if (entry.count >= maxRequests) {            // default 10 / window
    return { allowed: false, remaining: 0,
             retryAfterSec: Math.ceil((entry.resetAt - now) / 1000) };
  }
  entry.count += 1;
  return { allowed: true, remaining: maxRequests - entry.count };
}

Cost before tokens — the limiter runs first in the handler, so a blocked request never reaches (or bills) Replicate.
Client identity — getClientIp() reads x-forwarded-for / x-real-ip (Vercel proxy headers) and falls back to the socket address.
Standard signaling — every response sets X-RateLimit-Remaining; a 429 also sets Retry-After so clients can back off correctly.
Injectable store — the window, the cap, and the backing Map are all options, so tests drive the limiter deterministically without real time or real IPs.

Honest scope: the store is in-memory, so the limit is per serverless instance—right for a portfolio demo and abuse-blunting. The clear upgrade path is a shared store (Redis/Upstash) for a global limit across instances.

6. Multi-turn memory

A story isn't one prompt—it's a conversation. The client keeps an in-memory messages array of role/content turns and sends the whole history on every call, so each continuation is conditioned on everything written so far. The same array works for the AI and the fallback path.

// New story: reset the conversation
messages = [{ role: "user", content: seed }];

// Continue: append the user's nudge, resend full history
messages.push({ role: "user", content: prompt || "Continue the story." });
const res = await requestStory({ messages, tone, length }, controller.signal);
messages.push({ role: "assistant", content: story });

Conversation as state — messages is the single source of truth; the rendered story is just the assistant turns joined together.
Continue / regenerate — "Continue" appends a user turn and resends history; "Regenerate" re-runs the seed. Both flow through one runStoryRequest().
Validated server-side — roles are restricted to user/assistant, content is length-capped, and the array is capped at 20 messages to bound prompt size.
Persistence — completed stories are saved to localStorage and restorable from a collapsible history panel, no backend required.

7. AI-native builder

Beyond consuming an LLM at runtime, this project is where I lean on LLMs as a build tool—and the workflow only works because of the constraints around it.

AI as a tool, not a crutch — I use models to draft and refactor, then gate everything behind syntax checks, node --test, Playwright E2E, and an .env.example validator in CI.
Pure, mockable seams — small exported functions (buildPrompt, getLengthSettings, createRateLimiter, createHandler) make AI-generated changes easy to verify and hard to silently break.
Vanilla on purpose — no framework keeps the focus on AI/API behavior and the platform (fetch, AbortController, modules), which is exactly what makes the codebase legible to both humans and models.

8. Outcomes

The feature never dead-ends — credits out, transient 5xx, or timeout, the user still gets a story plus an honest notice about why.
Tested at the seams — node --test covers prompt building, length mapping, credit-error classification, and fallback; Playwright drives the generate flow against a mocked API.
CI gates each push — JS syntax checks, unit tests, and .env.example validation via npm run ci (E2E in ci:full).
Live on Vercel serverless—try the fallback by spamming generate until credits throttle: ai-stories-ashy.vercel.app

9. Links

Live app ai-stories-ashy.vercel.app Repository github.com/alejosworkstuff/ai-stories API handler api/generate-stories.js Rate limiter api/rate-limit.js HTTP resilience public/js/http.js Fallback generator public/js/localGenerator.js

ReplicateLlama 3ServerlessRate LimitingFallbackVanilla JS