agent-demo-video

Turn a script and a running web app into a finished, narrated, captioned demo video — automated, headless, zero-drift.

Open source (MIT) · Headless · Keyless dry-run · Audio-first sync by construction.

The problem

Demo videos are made by hand: screen-record, write a script, record a voiceover, edit, then re-sync captions — and redo the whole thing every time the UI changes. Narration, screen, and captions drift apart, and nothing about it is reproducible or CI-friendly.

A headless pipeline that turns a Markdown DEMO_SCRIPT — shots, browser actions, and narration — plus a running web app into a finished MP4. It's audio-first: narration is synthesised first, and its exact per-character timing becomes the clock that paces the browser recording and the captions, so audio, video, and captions are in sync by construction with zero drift. Run it keyless with FAKE_TTS to iterate on the script before spending any API quota.

Quickstart

Node + pnpm. One keyless dry run proves your script and action sequence before you spend a single TTS credit. The core has no hosted service — everything runs on your machine.

Install (Requires Node 18+ and pnpm):

git clone https://github.com/OrionArchitekton/agent-demo-video
cd agent-demo-video
pnpm install
pnpm exec playwright install chromium

Command surface

goto: Navigate the recording browser to a URL (relative to the dashboard base, or absolute).
click: Move the on-screen cursor to a selector and click it, with a visible click ripple.
type: Type text into a field character-by-character, like a real user.
hover: Hover a selector to reveal hover-only states before capturing them.
highlight: Draw an attention box around an element to direct the viewer’s eye.
chapter: Show a lower-third chapter banner to title a section of the walkthrough.
wait: Dwell for N milliseconds — e.g. while an async backend action finishes on screen.

Why it is different

Audio-first, zero-drift: Narration is synthesised first; its exact per-character timing becomes the clock that paces both the screen recording and the caption file. Audio, video, and captions are in sync by construction — not nudged into alignment afterwards.
Keyless dry run: FAKE_TTS swaps the TTS step for a silent track of estimated duration, so you can iterate on the script and action sequence — with real browser capture and real captions — without spending a cent of API quota.
Headless & reproducible: Playwright drives a headless Chromium with an injected fake cursor and overlays; ffmpeg does the rest. The whole render is one command, so it runs the same on your laptop and in CI — re-render on every UI change instead of re-recording by hand.
Honest by construction: A parity check fails the render if shot count, segment count, or audio/video duration disagree. A SaaS surface you should not live-drive uses a prebaked clip, spliced in at the right point — never faked as live.

The problem

Quickstart

Command surface

Why it is different

Links