The clean-source protocol, for image-to-video.
Here's the single biggest reason people's Seedance and Pippit clips come back broken: they upload the planning image. The one with the arrows and the labels and the little character turnaround in the corner. The model can't tell your director's notes from the scene — so it renders them. You get glowing Chinese characters floating in the sky, your guide-arrows turned into solid metal, your character standing in an A-pose spinning in midair. This guide fixes that for good, with a workflow that's been blown up and rebuilt enough times to be boring and reliable.
Adapted from a battle-tested Mandarin Pippit/Seedance protocol, rewritten for the prompt box. The bilingual template below keeps the Chinese keys — they carry conditioning weight you don't want to translate away.
The model renders your notes
Anything in the image that can't be read as light or texture gets built as an object. That's the whole rule. Text, arrows, and storyboard frame-lines become floating glowing letters, metal arrows, and literal borders inside the shot. Character three-views and zoomed-in detail thumbnails get collaged into one space — so you get extra arms, doubled people, a tiny picture-in-picture floating in the frame. Color swatches and material balls turn into glowing orbs and abstract sculptures.
There is exactly one fix, and no prompt can substitute for it: don't let the model see them. Annotations live only in the planning stage. If you're forced to reuse the same image as your source, send it through Midjourney or Stable Diffusion first to repaint it clean — strip every word and UI mark, keep the composition and light.
Two stages, never one
The control board guides the clean frame. The clean frame makes the video. Skip the middle and you gamble.
Stage A — the control board (internal only)
This holds all your director information: composition zones, shot-size labels, character positions, motion arrows, atmosphere notes. It is for your eyes and your image tool only. It NEVER enters the video model.
Stage B — the clean cinematic source (what Pippit gets)
A single cinematic still, generated from the control board, with zero text, arrows, or UI. This frame decides every visual thing about your video. It is the only image the model is allowed to see.
Use the board to make a clean source, then use the clean source to make the video. This is more than ten times more stable than any amount of prompt-wrangling that tries to convince the model to ignore the text it can plainly see.
Source-frame rules
If the pipeline only allows one upload, that one image must be the finished clean source, and it must obey:
Main subject ≥ 40% of the frame, set in the lower-mid golden-ratio zone. Too small and the model loses the subject and renders "ants in a wide shot." No-go zones: no text, icons, swatches, crosshairs, or crop frames in any corner. Labels go on the filename or in the text prompt, never in the pixels. Shoot native 9:16, subject whole — don't rely on the model to crop, because cropping makes it drift and push the lens around. Never include multi-panel comics, expression sheets, or turnarounds; the model tweens them into one deformed nightmare.
The five-layer depth stack
Epic scale isn't one big object — it's depth in motion. The source frame and the prompt have to agree on five layers, each with its own job and its own movement:
| LAYER | CONTENT | MOTION CUE |
|---|---|---|
| Foreground 10% | Sand, smoke, grit, embers | Streaks past fast, motion blur |
| Near-mid 20% | Marching troops, mech feet, banners | One steady direction, constant pace |
| Midground 40% | The hero — building, colossus, ship | Slow advance, high detail, clear |
| Background 20% | Skyline, mountains, smoke columns | Aerial haze, pale blue-grey, drifting |
| Sky 10% | Cloud, light beams, storm top | Tyndall rays, dramatic roll |
Light: rim your giant with side- or top-backlight, and paint that rim into the source frame — don't hope for it. Smoke and dust: layer it — fast motes up close, static columns far off. Crowds: the ant rule — small, low-detail, one shared direction. Machines: add self-illumination or reflections at key points so the model reads the material.
Three camera moves that hold for 10–15s
One continuous shot, no cuts. Put the move at the very front of the prompt and assign it time. Three structures survive the duration:
1. God's-eye descent. 0–3s drop through cloud, 3–8s punch out into a city reveal, 8–15s slow push past a midground colossus toward the army and storm behind it.
2. Low-angle follow (mortal's view). 0–5s ground-level up-look, a mech foot slams and cracks the earth; 5–10s tilt up and pull back to reveal the whole body and the legion behind it; 10–15s a slow lateral drift past the burning city through the joints.
3. Lateral reveal (the scroll). 0–4s skim a wall or ridge left-to-right, details flicking past; 4–10s pull back while moving to open the full army and smoke; 10–15s settle on the wide — empire in the sandstorm, a tiny sun.
The bilingual prompt blueprint
Author in English, but keep the Chinese keys — on a model trained heavily on Chinese captions, they pull weight that a pure-English translation loses. Fill the braces, keep it one continuous shot, stay well under the character limit.
[镜头运动与节奏 / Camera move + timing]: {one continuous shot, NO cuts. e.g. descend through cloud at 0-3s, punch through, then slow push around the city for 12s, no cut}
[环境与氛围 / Environment + mood]: {vast ancient megacity in a red sandstorm, countless smoke columns linking earth to sky, grim epic atmosphere}
[主体与多层运动 / Subject + layered motion]:
- 前景 / Foreground: {dust, ash, embers streak past fast, motion blur}
- 中景 / Midground: {marching bronze colossus legions, heavy footfalls, armor catching light}
- 背景 / Background: {endless spires and floating fortresses, fading in the sand haze}
- 天空 / Sky: {churning ochre sand-cloud, golden god-rays through the cracks, beams drifting slowly}
[光线与特效 / Light + FX]: side-backlight rims the colossus in gold, armor glows faint red sigils, dust motes sparkle in the beams.
[画质要求 / Quality]: cinematic, ARRI ALEXA 65, ultra-wide, vertical 9:16, layered depth of field, high detail, 4K.Failure diagnosis
| SYMPTOM | FIX THE IMAGE | FIX THE PROMPT |
|---|---|---|
| Too small, not epic | Push subject to midground, force ≥40%, wide-angle in close | “ultra-wide, close, exaggerated perspective, subject fills frame” |
| Subject deforms, multi-arm | One complete character only, no turnarounds | “keep exact original form, no deformation, single entity” |
| Twitchy, conflicting motion | Make the still's elements share one direction | Simplify to 1–2 global directions: “all elements drift right” |
| Text / arrow / UI residue | Repaint clean in AI or PS — remove every mark | No prompt fixes this. The source must be clean. |
| Random cuts, axis jumps | No image change needed | One continuous shot only; use “continues / meanwhile,” never “then cut” |
| Mushy, oil-painting look | Raise source to 1920×3360+, keep it sharp | “extreme detail, sharp focus, no blur, crisp texture” |
The whole SOP in one breath: build the annotated board (eyes only) → generate a clean 9:16 source from it in Midjourney/SD with the board as a 0.6–0.8 reference and "no text, no arrows, no UI" in the prompt → grade and sharpen to 1080×1920+ → write the move-first bilingual prompt → generate in Pippit at low-to-mid motion → review for text residue and deformation first, scale and stability second. Every shortcut costs more time than it saves. Do it once, properly.