Ramvoy Docs

How Ramvoy works.

Ramvoy creates long AI videos and content by using many specialized models together. The planner decides what needs to be made, adapters prepare each model call, models generate assets, and the internal assembly system renders the final video.

Total systems

Across image, video, speech, music, editing, and final assembly.

AI models

Used to generate visuals, clips, narration, music, and edits.

Internal systems

Final assembly is handled internally with FFmpeg.

Workflow

From prompt to final output.

This is the full production flow. Each part has a specific job, and together they turn one prompt into a complete AI video.

User input

The user chooses what they want to make, writes a prompt, selects a target length, and optionally uploads images. This is the creative brief for the whole run.

Agent type defines the kind of project.Prompt explains the story, topic, style, or goal.Length tells Ramvoy how big the final output should be.Images can become references, characters, products, or source frames.

Planner creates a structured plan

The planner turns the user request into a step-by-step production plan. Instead of calling one model blindly, it decides what assets are needed and how to create them.

Creates the strategy.Breaks the video into steps.Decides if images, speech, music, video clips, or captions are needed.Defines final assembly rules.

The plan selects models

Each plan step points to the model best suited for that job. One step may generate an image, another may create narration, another may generate video, and the final step assembles everything.

Flux creates or edits still images.TTS creates narration.Kling, PixVerse, Runway, Seedance, Grok, Ray, or Wan create video clips.Music models create songs, music beds, or soundtracks.Internal FFmpeg creates the final render.

Runtime adapters prepare inputs

Every model expects different input names, formats, durations, and settings. Runtime adapters translate Ramvoy’s plan into the exact input shape each model requires.

Normalizes prompts.Adds image or audio URLs.Sets duration, resolution, and aspect ratio.Prevents invalid model calls.

Assets are generated and stored

Ramvoy generates outputs step by step. Images, clips, narration, and music are saved so later steps can reuse them.

Generated images can become video inputs.Generated narration can drive presenter videos.Generated music can be mixed into the final video.Generated clips become the building blocks of the final output.

Final assembly renders the deliverable

The internal FFmpeg assembly step combines all generated assets into the final output. This is what turns separate model outputs into one complete video.

Concatenates clips.Loops still images when needed.Mixes narration, music, and native video audio.Exports the final video file.

Model stack

What each model does.

Ramvoy does not rely on one model. It routes each production task to the best model for that job.

Image generation

These models create or edit still images used as source frames, character references, product visuals, thumbnails, scene anchors, or concept art.

Flux 2 Pro

Text / image → image

flux_2_pro

Creates high-quality images from prompts and optional reference images. Best for source visuals before video generation.

What it does in Ramvoy

Ramvoy can use this to create the first visual version of a scene, character, product, poster, thumbnail, or cinematic still.

Inputs

TextImage references

Outputs

Image

Limits

Supports up to 8 reference images.

Cost

$0.60 per run

Scene concept imagesCharacter visualsProduct shotsBrand visualsVideo source frames

Flux 2 Pro Edit

Text + image → edited image

flux_2_pro_edit

Edits existing images using natural language instructions while preserving quality and subject consistency.

What it does in Ramvoy

Ramvoy can use this when the user uploads an image and wants it improved, restyled, cleaned up, modified, or prepared for video generation.

Inputs

Text instructionImage

Outputs

Edited image

Limits

Can match the input image shape or use custom aspect ratios.

Cost

$0.10 per run

Edit uploaded imagesPolish source framesChange style or compositionProduct image editsCharacter consistency

Video generation

These models create short video clips from text, images, or video inputs. Ramvoy combines many clips together to create longer AI videos.

Grok Imagine Video

Text / image / video → video + audio

grok_imagine_video

Creates short videos with audio from text or images, and can also edit existing videos.

What it does in Ramvoy

Ramvoy can use this for fast audio-aware video clips, animated image scenes, or short edits to existing video inputs.

Inputs

TextImageVideo

Outputs

VideoAudio

Limits

Generation: 1–15 seconds. Editing: input video limit is 8.7 seconds, and output matches input duration, ratio, and resolution.

Cost

$0.10 per second

Short cinematic clipsImage-to-video animationVideo editingAudio-aware scenes

Runway Gen-4.5

Text / image → video

runway_gen_4_5

Premium cinematic video generation with strong prompt adherence, realism, and physical accuracy.

What it does in Ramvoy

Ramvoy can use this for hero shots, premium cinematic clips, realistic movement, and high-quality visual sequences.

Inputs

TextImage

Outputs

Video

Limits

Allowed durations: 5 or 10 seconds. Supports 720p and 1080p.

Cost

$0.24 per second

Premium hero clipsCinematic scenesRealistic motionHigh-end visual sequences

Kling Video 3 Standard

Text / image → video

kling_video_3_standard

720p cinematic video generation for short story-driven clips and social media scenes.

What it does in Ramvoy

Ramvoy can use this when it needs affordable cinematic clips with multi-shot style planning.

Inputs

TextImage

Outputs

Video

Limits

3–15 seconds. 720p. Aspect ratios: 16:9, 9:16, 1:1.

Cost

$0.32 per second

Story clipsSocial videosProduct demosMarketing scenes

Kling Video 3 Standard Audio

Text / image → video + audio

kling_video_3_standard_audio

720p video generation with stronger native audio support, including dialogue, sound effects, and ambience.

What it does in Ramvoy

Ramvoy can use this when a scene needs native sound, dialogue-like audio, ambience, or stronger audio-video sync.

Inputs

TextImage

Outputs

VideoAudio

Limits

3–15 seconds. 720p. Aspect ratios: 16:9, 9:16, 1:1.

Cost

$0.50 per second

Dialogue scenesNative audio clipsLip-sync style scenesSFX scenes

Kling Video 3 Pro

Text / image → video

kling_video_3_pro

1080p cinematic video generation with improved fidelity and consistency.

What it does in Ramvoy

Ramvoy can use this for higher-quality visual scenes when audio is handled separately by TTS or music models.

Inputs

TextImage

Outputs

Video

Limits

3–15 seconds. 1080p. Aspect ratios: 16:9, 9:16, 1:1.

Cost

$0.50 per second

High-fidelity scenesCinematic clipsNarrative sequencesPolished visuals

Kling Video 3 Pro Audio

Text / image → video + audio

kling_video_3_pro_audio

1080p cinematic video generation with native audio, sound effects, ambience, and dialogue-style support.

What it does in Ramvoy

Ramvoy can use this for polished short-form clips where both visual quality and built-in audio matter.

Inputs

TextImage

Outputs

VideoAudio

Limits

3–15 seconds. 1080p. Aspect ratios: 16:9, 9:16, 1:1.

Cost

$0.60 per second

Premium audio scenesDialogue clipsSFX-heavy scenesHigh-quality social videos

PixVerse v5.6

Text / image → video + audio

pixverse_v5_6

Cost-effective video generation with native audio-style output, camera movement, dialogue-style scenes, and multi-shot support.

What it does in Ramvoy

Ramvoy can use PixVerse when it needs a balance of cost, speed, audio-aware generation, and cinematic camera motion.

Inputs

TextImage

Outputs

VideoAudio

Limits

1–15 seconds. Available variants include 360p, 540p, 720p, and 1080p.

Cost

$0.14–$0.30 per second depending on quality

Affordable short clipsStylized scenesAudio-aware scenesCamera movementSocial content

Wan 2.2 T2V Fast

Text → video

wan_2_2_t2v_fast

Fast, low-cost text-to-video model for efficient clip generation.

What it does in Ramvoy

Ramvoy can use this for cheaper utility clips, background shots, quick tests, or lower-cost video generation.

Inputs

Text

Outputs

Video

Limits

Available in 480p and 720p.

Cost

$0.10–$0.20 per run

Low-cost clipsFast draft videosSimple text-to-video scenes

Seedance 1.0 Pro

Text / image → video

seedance_1_pro

1080p text-to-video and image-to-video model with strong narrative, cinematic, and multi-shot capabilities.

What it does in Ramvoy

Ramvoy can use this for complex narrative scenes, cinematic action, documentary-style shots, and consistent visual storytelling.

Inputs

TextImage

Outputs

Video

Limits

1080p. Supports multiple aspect ratios.

Cost

$0.30 per second

Narrative scenesDocumentary visualsComplex actionHigher-resolution outputs

Ray Flash 2

Text → video

ray_flash_2_720p

720p video generation focused on realistic motion, physics, environments, and cinematic composition.

What it does in Ramvoy

Ramvoy can use this for environmental motion, realistic physical scenes, macro shots, natural effects, and dynamic movement.

Inputs

Text

Outputs

Video

Limits

5-second clips at 720p.

Cost

$0.12 per second

Motion-heavy shotsEnvironmental scenesPhysics realismCinematic B-roll

Fabric 1.0

Image + audio → talking-head video

fabric_1_0

Creates a speaking presenter or avatar video from a source image and an audio track.

What it does in Ramvoy

Ramvoy can use this when it needs a face, avatar, influencer, or presenter to speak using generated narration.

Inputs

ImageAudio

Outputs

Video

Limits

1–60 seconds. Available in 480p and 720p.

Cost

$0.16–$0.30 per second depending on resolution

Presenter videosAvatar videosTalking-head clipsSpeech-driven face animation

Speech and music

These models generate voiceovers, narration, songs, instrumentals, soundtracks, and longer-form audio that can be mixed into the final video.

TTS 1.5 Max

Text → speech audio

tts_1_5_max

Creates high-quality narration, dialogue, expressive speech, multilingual voiceovers, and SSML-based delivery.

What it does in Ramvoy

Ramvoy can use this to generate narration for documentaries, explainer videos, trailers, movies, and presenter workflows.

Inputs

Text

Outputs

Audio

Limits

Supports emotion markup, SSML, expressive delivery, and optional voice cloning.

Cost

$0.06 per 1,000 characters

NarrationVoiceoverDialogueExplainersDocumentaries

Music 1.5

Text / audio → music audio

music_1_5

Creates accompaniment and vocal-based music from lyrics and style prompts, with optional reference music.

What it does in Ramvoy

Ramvoy can use this for music videos, songs with lyrics, branded tracks, and music-led projects.

Inputs

TextAudio reference

Outputs

Audio

Limits

1–240 seconds. Supports lyrics, vocals, and reference-style music.

Cost

$0.06 per run

Songs with lyricsMusic videosPrompt-driven musicVocal tracks

Lyria 2

Text → music audio

lyria_2

Creates high-fidelity 48kHz stereo music across genres like classical, jazz, pop, electronic, and orchestral.

What it does in Ramvoy

Ramvoy can use this for clean background music, cinematic scoring, transitions, and soundtrack-style audio.

Inputs

Text

Outputs

Audio

Limits

1–30 seconds.

Cost

$0.004 per second

Background musicSoundtracksShort music bedsCinematic audio

ACE-Step

Text / audio → music audio

ace_step

Flexible music model for full songs, duration control, remixing, style transfer, lyric editing, and longer creative audio.

What it does in Ramvoy

Ramvoy can use this for longer songs, remix-style workflows, lyric-aware tracks, and creative music generation.

Inputs

TextAudio reference

Outputs

Audio

Limits

1–300 seconds.

Cost

$0.06 per run

Longer songsRemixesLyric-aware musicStyle transferCreative music workflows

Assembly

After images, clips, narration, and music are generated, Ramvoy uses an internal rendering system to combine everything into the final output.

Internal FFmpeg Assembly

Video / audio / image / captions → final video

internal_ffmpeg

Internal rendering step that stitches clips, loops images, mixes narration and music, adds timing, and exports the final video.

What it does in Ramvoy

Ramvoy uses this as the final production stage. It turns many separate AI-generated assets into one complete video file.

Inputs

VideoAudioCaptionsImage

Outputs

Final video

Limits

Supports 480p, 720p, and 1080p. Supports long-form assembly.

Cost

$0 internal cost

Final renderClip concatenationAudio mixingImage-to-video assemblyLong-form video output

Architecture

Planner first, adapters second, render last.

This is the core architecture. The planner decides what should happen, adapters make every model call valid, execution generates the assets, and assembly creates the final video.

Planner

Structured execution plan

Converts the prompt into a production plan with scenes, assets, model choices, timing, and assembly rules.

Adapters

Model-specific inputs

Convert each planner step into the exact input format expected by Grok, Runway, Kling, Flux, TTS, music models, or FFmpeg.

Execution

Assets are generated

Runs each model step and stores generated images, video clips, narration, music, and other outputs.

Assembly

FFmpeg final render

Stitches clips, loops stills, mixes audio, adds timing, and exports the final video deliverable.

Example flow

A typical run

A single Ramvoy run can call multiple models. This example shows how a long AI video can be built from separate specialized steps.

Step 1

Generate or edit visuals

Flux 2 Pro creates source images, characters, or scene anchors.

Step 2

Generate narration

TTS 1.5 Max creates voiceover, dialogue, or presenter speech.

Step 3

Generate video clips

Runway, Kling, PixVerse, Seedance, Grok, Ray, Wan, or Fabric creates motion clips.

Step 4

Generate music

Music 1.5, Lyria 2, or ACE-Step creates soundtrack, songs, or background music.

Step 5

Assemble the final video

Internal FFmpeg combines visuals, clips, narration, music, and timing into one final output.