Blog/

From Script to Finished Video: Inside the agent-media Pipeline

When you run agent-media ugc, a lot happens behind the scenes. Your script goes through 7 distinct stages before becoming a finished video. This article breaks down each stage so you understand what is happening and how to get the best results.

1

Script Splitting

The pipeline starts by analyzing your script and splitting it into scenes. Each scene becomes a separate talking head + B-roll segment. The AI detects natural pauses, topic changes, and sentence boundaries to find the best split points.

Typically, this produces 2-5 scenes for a 10-15 second video. Shorter scripts may stay as a single scene, while longer scripts with multiple ideas get split so each segment has a focused topic. This matters because the B-roll generation in stage 4 works best when each scene has a clear, singular subject.

2

TTS Voiceover

Each scene's text is converted to speech using AI voice synthesis. The voice sounds natural with proper intonation, pacing, and emphasis on the right words. No robotic monotone — the AI understands context and adjusts delivery accordingly.

The audio timing from this stage determines the length of each scene. If a scene's voiceover runs 3.2 seconds, the talking head and B-roll for that scene will be exactly 3.2 seconds. The voiceover is the master clock for the entire video.

3

AI Talking Heads

Your face photo is animated to match the voiceover audio. The AI generates lip sync, natural head movements, blinking, and micro-expressions. The result looks like a real person speaking your script.

This is the most computationally intensive stage. The model analyzes the audio waveform to determine mouth shapes frame by frame, then renders the animation with consistent lighting and perspective that matches your original photo. A well-lit, front-facing photo produces the most convincing results.

4

B-Roll Generation

The pipeline generates contextual B-roll footage to intercut with the talking head. The B-roll matches the topic of each scene — product shots, lifestyle footage, workspace visuals, or whatever fits the script content.

This breaks up the "talking to camera" monotony and keeps viewers engaged. Social media audiences scroll past static talking heads, but alternating between a speaker and relevant visuals holds attention significantly longer. The AI reads each scene's text to determine what kind of B-roll will complement the message.

5

Assembly

The talking head clips and B-roll are assembled together with crossfade transitions, loudnorm audio normalization, and proper timing. The video flows naturally between talking head and B-roll segments, with each transition carefully timed to avoid cutting mid-word or mid-gesture.

Audio normalization ensures consistent volume throughout the video. No jarring volume jumps between scenes, no clipping, no audio that is too quiet to hear. The assembly stage produces a single, cohesive video file ready for the final two stages.

6

Animated Subtitles

Subtitles are generated from the voiceover transcript and animated according to your chosen style. Word-level timing ensures perfect sync — each word highlights exactly when it is spoken.

Five subtitle styles are available:

StyleDescription
hormoziBold, high-contrast captions popularized by Alex Hormozi-style content
boldLarge, clean text with strong visual presence
karaokeWords highlight progressively as they are spoken
tiktokNative TikTok-style animated captions
minimalClean, understated text that stays out of the way
7

Music + CTA

Background music is added at a low volume to add energy without competing with the voiceover. The music bed gives the video a polished, professional feel that silent UGC content lacks.

An optional end-screen CTA overlay can include your brand or call to action. This is the final touch — your finished video is ready to publish directly to TikTok, Instagram Reels, YouTube Shorts, or any other platform.

What you can control

The pipeline accepts several parameters that let you customize the output. Here is the full CLI command with all available flags:

$ agent-media ugc \

--script "Your script text goes here" \

--face face-photo.jpg \

--subtitle-style hormozi \

--duration 10 \

--aspect-ratio 9:16

--scriptThe text the AI will speak. Write conversationally — this becomes the voiceover.
--facePath to your face photo. Used to generate the talking head animation.
--subtitle-styleOne of: hormozi, bold, karaoke, tiktok, minimal.
--durationTarget video length in seconds. Options: 5, 10, or 15.
--aspect-ratioOutput aspect ratio. Use 9:16 for vertical (TikTok, Reels, Shorts).
--ai-scriptLet the AI generate the script for you automatically. Adds 5 credits.

Pipeline performance

A typical 10-second video takes approximately 2-4 minutes to generate from start to finish. All 7 stages run server-side — your local machine does not need a GPU or any special hardware. You submit the job and receive a URL to the finished video when it is done.

Shorter videos (5 seconds) finish faster because there are fewer scenes to process. Longer videos (15 seconds) take proportionally longer due to additional talking head and B-roll generation. The bottleneck is usually stage 3 (AI talking heads), which accounts for roughly half the total processing time.

Tips for better results

Write clear, conversational scripts

The TTS engine performs best with natural, spoken language. Write the way you talk, not the way you write an essay. Short sentences. Direct statements. Avoid jargon, acronyms, and complex sentence structures that trip up the voice synthesis.

Use a well-lit face photo

The talking head stage works from a single photo. Even, front-facing lighting with a clear view of your face produces the most realistic animation. Avoid sunglasses, heavy shadows, extreme angles, or photos where your face is partially obscured.

Keep videos under 15 seconds

Short-form content performs best on social media. The 5, 10, and 15 second durations are optimized for TikTok, Instagram Reels, and YouTube Shorts. Shorter videos have higher completion rates, which algorithms reward with more reach.

Use the AI script writer

Not sure what to say? Use the --ai-script flag to generate a script automatically. The AI writes conversational, platform-optimized copy tuned for short-form video. It costs just 5 extra credits and saves significant time on the writing step.

Cost breakdown

UGC pipeline pricing is straightforward: 30 credits per second of output video, plus 5 credits if you use the AI script writer.

DurationCreditsWith AI Script
5 seconds150155
10 seconds300305
15 seconds450455

30 credits/second. AI script writer adds 5 credits flat.

Try the pipeline

One command. Seven stages. A finished UGC video in under 4 minutes.

$ agent-media ugc --script "..." --face photo.jpg --duration 10