From Script to Finished Video: Inside the agent-media Pipeline
When you run agent-media ugc, a lot happens behind the scenes. Your script goes through 7 distinct stages before becoming a finished video. This article breaks down each stage so you understand what is happening and how to get the best results.
Script Splitting
The pipeline starts by analyzing your script and splitting it into scenes. Each scene becomes a separate talking head + B-roll segment. The AI detects natural pauses, topic changes, and sentence boundaries to find the best split points.
Typically, this produces 2-5 scenes for a 10-15 second video. Shorter scripts may stay as a single scene, while longer scripts with multiple ideas get split so each segment has a focused topic. This matters because the B-roll generation in stage 4 works best when each scene has a clear, singular subject.
TTS Voiceover
Each scene's text is converted to speech using AI voice synthesis. The voice sounds natural with proper intonation, pacing, and emphasis on the right words. No robotic monotone — the AI understands context and adjusts delivery accordingly.
The audio timing from this stage determines the length of each scene. If a scene's voiceover runs 3.2 seconds, the talking head and B-roll for that scene will be exactly 3.2 seconds. The voiceover is the master clock for the entire video.
AI Talking Heads
Your face photo is animated to match the voiceover audio. The AI generates lip sync, natural head movements, blinking, and micro-expressions. The result looks like a real person speaking your script.
This is the most computationally intensive stage. The model analyzes the audio waveform to determine mouth shapes frame by frame, then renders the animation with consistent lighting and perspective that matches your original photo. A well-lit, front-facing photo produces the most convincing results.
B-Roll Generation
The pipeline generates contextual B-roll footage to intercut with the talking head. The B-roll matches the topic of each scene — product shots, lifestyle footage, workspace visuals, or whatever fits the script content.
This breaks up the "talking to camera" monotony and keeps viewers engaged. Social media audiences scroll past static talking heads, but alternating between a speaker and relevant visuals holds attention significantly longer. The AI reads each scene's text to determine what kind of B-roll will complement the message.
Assembly
The talking head clips and B-roll are assembled together with crossfade transitions, loudnorm audio normalization, and proper timing. The video flows naturally between talking head and B-roll segments, with each transition carefully timed to avoid cutting mid-word or mid-gesture.
Audio normalization ensures consistent volume throughout the video. No jarring volume jumps between scenes, no clipping, no audio that is too quiet to hear. The assembly stage produces a single, cohesive video file ready for the final two stages.
Animated Subtitles
Subtitles are generated from the voiceover transcript and animated according to your chosen style. Word-level timing ensures perfect sync — each word highlights exactly when it is spoken.
Five subtitle styles are available:
| Style | Description |
|---|---|
| hormozi | Bold, high-contrast captions popularized by Alex Hormozi-style content |
| bold | Large, clean text with strong visual presence |
| karaoke | Words highlight progressively as they are spoken |
| tiktok | Native TikTok-style animated captions |
| minimal | Clean, understated text that stays out of the way |
Music + CTA
Background music is added at a low volume to add energy without competing with the voiceover. The music bed gives the video a polished, professional feel that silent UGC content lacks.
An optional end-screen CTA overlay can include your brand or call to action. This is the final touch — your finished video is ready to publish directly to TikTok, Instagram Reels, YouTube Shorts, or any other platform.
What you can control
The pipeline accepts several parameters that let you customize the output. Here is the full CLI command with all available flags:
$ agent-media ugc \
--script "Your script text goes here" \
--face face-photo.jpg \
--subtitle-style hormozi \
--duration 10 \
--aspect-ratio 9:16
--scriptThe text the AI will speak. Write conversationally — this becomes the voiceover.--facePath to your face photo. Used to generate the talking head animation.--subtitle-styleOne of: hormozi, bold, karaoke, tiktok, minimal.--durationTarget video length in seconds. Options: 5, 10, or 15.--aspect-ratioOutput aspect ratio. Use 9:16 for vertical (TikTok, Reels, Shorts).--ai-scriptLet the AI generate the script for you automatically. Adds 5 credits.Pipeline performance
A typical 10-second video takes approximately 2-4 minutes to generate from start to finish. All 7 stages run server-side — your local machine does not need a GPU or any special hardware. You submit the job and receive a URL to the finished video when it is done.
Shorter videos (5 seconds) finish faster because there are fewer scenes to process. Longer videos (15 seconds) take proportionally longer due to additional talking head and B-roll generation. The bottleneck is usually stage 3 (AI talking heads), which accounts for roughly half the total processing time.
Tips for better results
Write clear, conversational scripts
The TTS engine performs best with natural, spoken language. Write the way you talk, not the way you write an essay. Short sentences. Direct statements. Avoid jargon, acronyms, and complex sentence structures that trip up the voice synthesis.
Use a well-lit face photo
The talking head stage works from a single photo. Even, front-facing lighting with a clear view of your face produces the most realistic animation. Avoid sunglasses, heavy shadows, extreme angles, or photos where your face is partially obscured.
Keep videos under 15 seconds
Short-form content performs best on social media. The 5, 10, and 15 second durations are optimized for TikTok, Instagram Reels, and YouTube Shorts. Shorter videos have higher completion rates, which algorithms reward with more reach.
Use the AI script writer
Not sure what to say? Use the --ai-script flag to generate a script automatically. The AI writes conversational, platform-optimized copy tuned for short-form video. It costs just 5 extra credits and saves significant time on the writing step.
Cost breakdown
UGC pipeline pricing is straightforward: 30 credits per second of output video, plus 5 credits if you use the AI script writer.
| Duration | Credits | With AI Script |
|---|---|---|
| 5 seconds | 150 | 155 |
| 10 seconds | 300 | 305 |
| 15 seconds | 450 | 455 |
30 credits/second. AI script writer adds 5 credits flat.
Try the pipeline
One command. Seven stages. A finished UGC video in under 4 minutes.