How Ramvoy works.
Ramvoy creates long AI videos and content by using many specialized models together. The planner decides what needs to be made, adapters prepare each model call, models generate assets, and the internal assembly system renders the final video.
Total systems
23
Across image, video, speech, music, editing, and final assembly.
AI models
22
Used to generate visuals, clips, narration, music, and edits.
Internal systems
1
Final assembly is handled internally with FFmpeg.
Workflow
From prompt to final output.
This is the full production flow. Each part has a specific job, and together they turn one prompt into a complete AI video.
User input
The user chooses what they want to make, writes a prompt, selects a target length, and optionally uploads images. This is the creative brief for the whole run.
Planner creates a structured plan
The planner turns the user request into a step-by-step production plan. Instead of calling one model blindly, it decides what assets are needed and how to create them.
The plan selects models
Each plan step points to the model best suited for that job. One step may generate an image, another may create narration, another may generate video, and the final step assembles everything.
Runtime adapters prepare inputs
Every model expects different input names, formats, durations, and settings. Runtime adapters translate Ramvoy’s plan into the exact input shape each model requires.
Assets are generated and stored
Ramvoy generates outputs step by step. Images, clips, narration, and music are saved so later steps can reuse them.
Final assembly renders the deliverable
The internal FFmpeg assembly step combines all generated assets into the final output. This is what turns separate model outputs into one complete video.
Model stack
What each model does.
Ramvoy does not rely on one model. It routes each production task to the best model for that job.
Image generation
These models create or edit still images used as source frames, character references, product visuals, thumbnails, scene anchors, or concept art.
Flux 2 Pro
Text / image → image
Creates high-quality images from prompts and optional reference images. Best for source visuals before video generation.
What it does in Ramvoy
Ramvoy can use this to create the first visual version of a scene, character, product, poster, thumbnail, or cinematic still.
Inputs
Outputs
Limits
Supports up to 8 reference images.
Cost
$0.60 per run
Flux 2 Pro Edit
Text + image → edited image
Edits existing images using natural language instructions while preserving quality and subject consistency.
What it does in Ramvoy
Ramvoy can use this when the user uploads an image and wants it improved, restyled, cleaned up, modified, or prepared for video generation.
Inputs
Outputs
Limits
Can match the input image shape or use custom aspect ratios.
Cost
$0.10 per run
Video generation
These models create short video clips from text, images, or video inputs. Ramvoy combines many clips together to create longer AI videos.
Grok Imagine Video
Text / image / video → video + audio
Creates short videos with audio from text or images, and can also edit existing videos.
What it does in Ramvoy
Ramvoy can use this for fast audio-aware video clips, animated image scenes, or short edits to existing video inputs.
Inputs
Outputs
Limits
Generation: 1–15 seconds. Editing: input video limit is 8.7 seconds, and output matches input duration, ratio, and resolution.
Cost
$0.10 per second
Runway Gen-4.5
Text / image → video
Premium cinematic video generation with strong prompt adherence, realism, and physical accuracy.
What it does in Ramvoy
Ramvoy can use this for hero shots, premium cinematic clips, realistic movement, and high-quality visual sequences.
Inputs
Outputs
Limits
Allowed durations: 5 or 10 seconds. Supports 720p and 1080p.
Cost
$0.24 per second
Kling Video 3 Standard
Text / image → video
720p cinematic video generation for short story-driven clips and social media scenes.
What it does in Ramvoy
Ramvoy can use this when it needs affordable cinematic clips with multi-shot style planning.
Inputs
Outputs
Limits
3–15 seconds. 720p. Aspect ratios: 16:9, 9:16, 1:1.
Cost
$0.32 per second
Kling Video 3 Standard Audio
Text / image → video + audio
720p video generation with stronger native audio support, including dialogue, sound effects, and ambience.
What it does in Ramvoy
Ramvoy can use this when a scene needs native sound, dialogue-like audio, ambience, or stronger audio-video sync.
Inputs
Outputs
Limits
3–15 seconds. 720p. Aspect ratios: 16:9, 9:16, 1:1.
Cost
$0.50 per second
Kling Video 3 Pro
Text / image → video
1080p cinematic video generation with improved fidelity and consistency.
What it does in Ramvoy
Ramvoy can use this for higher-quality visual scenes when audio is handled separately by TTS or music models.
Inputs
Outputs
Limits
3–15 seconds. 1080p. Aspect ratios: 16:9, 9:16, 1:1.
Cost
$0.50 per second
Kling Video 3 Pro Audio
Text / image → video + audio
1080p cinematic video generation with native audio, sound effects, ambience, and dialogue-style support.
What it does in Ramvoy
Ramvoy can use this for polished short-form clips where both visual quality and built-in audio matter.
Inputs
Outputs
Limits
3–15 seconds. 1080p. Aspect ratios: 16:9, 9:16, 1:1.
Cost
$0.60 per second
PixVerse v5.6
Text / image → video + audio
Cost-effective video generation with native audio-style output, camera movement, dialogue-style scenes, and multi-shot support.
What it does in Ramvoy
Ramvoy can use PixVerse when it needs a balance of cost, speed, audio-aware generation, and cinematic camera motion.
Inputs
Outputs
Limits
1–15 seconds. Available variants include 360p, 540p, 720p, and 1080p.
Cost
$0.14–$0.30 per second depending on quality
Wan 2.2 T2V Fast
Text → video
Fast, low-cost text-to-video model for efficient clip generation.
What it does in Ramvoy
Ramvoy can use this for cheaper utility clips, background shots, quick tests, or lower-cost video generation.
Inputs
Outputs
Limits
Available in 480p and 720p.
Cost
$0.10–$0.20 per run
Seedance 1.0 Pro
Text / image → video
1080p text-to-video and image-to-video model with strong narrative, cinematic, and multi-shot capabilities.
What it does in Ramvoy
Ramvoy can use this for complex narrative scenes, cinematic action, documentary-style shots, and consistent visual storytelling.
Inputs
Outputs
Limits
1080p. Supports multiple aspect ratios.
Cost
$0.30 per second
Ray Flash 2
Text → video
720p video generation focused on realistic motion, physics, environments, and cinematic composition.
What it does in Ramvoy
Ramvoy can use this for environmental motion, realistic physical scenes, macro shots, natural effects, and dynamic movement.
Inputs
Outputs
Limits
5-second clips at 720p.
Cost
$0.12 per second
Fabric 1.0
Image + audio → talking-head video
Creates a speaking presenter or avatar video from a source image and an audio track.
What it does in Ramvoy
Ramvoy can use this when it needs a face, avatar, influencer, or presenter to speak using generated narration.
Inputs
Outputs
Limits
1–60 seconds. Available in 480p and 720p.
Cost
$0.16–$0.30 per second depending on resolution
Speech and music
These models generate voiceovers, narration, songs, instrumentals, soundtracks, and longer-form audio that can be mixed into the final video.
TTS 1.5 Max
Text → speech audio
Creates high-quality narration, dialogue, expressive speech, multilingual voiceovers, and SSML-based delivery.
What it does in Ramvoy
Ramvoy can use this to generate narration for documentaries, explainer videos, trailers, movies, and presenter workflows.
Inputs
Outputs
Limits
Supports emotion markup, SSML, expressive delivery, and optional voice cloning.
Cost
$0.06 per 1,000 characters
Music 1.5
Text / audio → music audio
Creates accompaniment and vocal-based music from lyrics and style prompts, with optional reference music.
What it does in Ramvoy
Ramvoy can use this for music videos, songs with lyrics, branded tracks, and music-led projects.
Inputs
Outputs
Limits
1–240 seconds. Supports lyrics, vocals, and reference-style music.
Cost
$0.06 per run
Lyria 2
Text → music audio
Creates high-fidelity 48kHz stereo music across genres like classical, jazz, pop, electronic, and orchestral.
What it does in Ramvoy
Ramvoy can use this for clean background music, cinematic scoring, transitions, and soundtrack-style audio.
Inputs
Outputs
Limits
1–30 seconds.
Cost
$0.004 per second
ACE-Step
Text / audio → music audio
Flexible music model for full songs, duration control, remixing, style transfer, lyric editing, and longer creative audio.
What it does in Ramvoy
Ramvoy can use this for longer songs, remix-style workflows, lyric-aware tracks, and creative music generation.
Inputs
Outputs
Limits
1–300 seconds.
Cost
$0.06 per run
Assembly
After images, clips, narration, and music are generated, Ramvoy uses an internal rendering system to combine everything into the final output.
Internal FFmpeg Assembly
Video / audio / image / captions → final video
Internal rendering step that stitches clips, loops images, mixes narration and music, adds timing, and exports the final video.
What it does in Ramvoy
Ramvoy uses this as the final production stage. It turns many separate AI-generated assets into one complete video file.
Inputs
Outputs
Limits
Supports 480p, 720p, and 1080p. Supports long-form assembly.
Cost
$0 internal cost
Architecture
Planner first, adapters second, render last.
This is the core architecture. The planner decides what should happen, adapters make every model call valid, execution generates the assets, and assembly creates the final video.
Planner
Structured execution plan
Converts the prompt into a production plan with scenes, assets, model choices, timing, and assembly rules.
Adapters
Model-specific inputs
Convert each planner step into the exact input format expected by Grok, Runway, Kling, Flux, TTS, music models, or FFmpeg.
Execution
Assets are generated
Runs each model step and stores generated images, video clips, narration, music, and other outputs.
Assembly
FFmpeg final render
Stitches clips, loops stills, mixes audio, adds timing, and exports the final video deliverable.
Example flow
A typical run
A single Ramvoy run can call multiple models. This example shows how a long AI video can be built from separate specialized steps.
Step 1
Generate or edit visuals
Flux 2 Pro creates source images, characters, or scene anchors.
Step 2
Generate narration
TTS 1.5 Max creates voiceover, dialogue, or presenter speech.
Step 3
Generate video clips
Runway, Kling, PixVerse, Seedance, Grok, Ray, Wan, or Fabric creates motion clips.
Step 4
Generate music
Music 1.5, Lyria 2, or ACE-Step creates soundtrack, songs, or background music.
Step 5
Assemble the final video
Internal FFmpeg combines visuals, clips, narration, music, and timing into one final output.