I just edited a 4K YouTube video — audio cleaned and leveled, clips merged with transitions, captions timed to my own words, a thumbnail rendered — and I never opened a video editor. I talked to an AI instead.
Here's the why first. I'm building toward $5K/month in passive income by 2033, and one of the pillars I'm betting on is this YouTube channel — which means I actually have to produce videos. Around a full-time job. On limited hours. So I built a system that takes me from raw footage to a finished, captioned, 4K video almost entirely through conversation.
Here's how it works.
The Workflow at a Glance
- Shoot the clips on my phone
- Normalize the audio to YouTube standards (automatic)
- Merge the clips with transitions in Remotion
- Transcribe the audio with Whisper
- Use the transcript to place accent text at exact moments
- Build the thumbnail
- Render
The entire thing — minus the shooting — happens inside a single Remotion project that I can edit just by talking to Claude. Remotion is free, and Claude puts it to work for me.
Step 1: Audio Normalization (Automatic)
YouTube has a loudness standard: -14 LUFS. If your video is too quiet or too loud, YouTube re-levels it on its end and your audio sounds off.
Rather than deal with that manually, I have a script that runs every clip through ffmpeg with a filter chain:
- A high-pass filter at 120 Hz to kill wind rumble from recording outdoors
- An FFT denoiser to clean up broadband hiss
- A loudnorm filter targeting -14 LUFS / -1 dBTP
All four clips process in one shot. The video stream is copied untouched — only the audio gets re-encoded. So I keep full 4K quality while the audio gets cleaned and leveled to spec automatically.
Step 2: Merging Clips with Transitions
Remotion is a tool that lets you build videos in React. Instead of dragging clips around in a timeline, you write a composition in code. The clips are joined with 0.5-second crossfade transitions using Remotion's TransitionSeries.
What makes this powerful is the live Studio preview. I open it in my browser and scrub through the full merged video in real time. If I want to trim the beginning off a clip, I change one number — and the preview updates immediately. No re-exporting. No waiting.
Step 3: The Transcript System — Read, Then Place
This is the part I'm most excited about, because it creates a feedback loop between what I said and what appears on screen.
After the clips are normalized, I run them through Whisper (medium.en) — OpenAI's speech-to-text model. This produces a timestamped subtitle file for each clip.
A Python script then assembles those four files into a single markdown transcript with final-timeline frame numbers — accounting for the crossfade overlaps so the timestamps reflect the actual merged video, not the individual clips.
- [2:47] (f5019) The first part of this is trading.
- [3:58] (f7149) The second one is this YouTube channel.
- [5:24] (f9736) The third vehicle is going to be a business...
Those frame numbers map directly to text overlay entries in the composition. When I want an accent line at the exact moment I say "The first part of this is trading," I look up the frame and add one line to the code:
{ text: "Pillar 1 — Trading", style: "fact", startFrame: 5019, durationInFrames: 150 }
The overlay appears in the Studio preview instantly. I ended up with eight accent lines across this video, each locked to a spoken beat:
| Time | Text | Style |
|---|---|---|
| 0:03 | Passive Income | Orange |
| 1:03 | Freedom Number | White |
| 2:10 | 50 / 30 / 20 | White |
| 2:48 | Pillar 1 — Trading | White |
| 3:58 | Pillar 2 — YouTube | White |
| 4:44 | Neo-pioneer.com | Orange |
| 5:24 | Pillar 3 — A Business | White |
| 5:54 | Goal: Systems by 2033 | Orange |
Reading the transcript, deciding where to put text, and having Claude write the entries took about 10 minutes.
Step 4: I Shot Vertical. On Purpose (Sort Of).
My older videos were shot landscape — standard horizontal framing. This batch I shot on my phone, holding it upright. Vertical footage dropped into a 3840×2160 landscape frame gives you a tiny picture with black bars on both sides, which looks bad.
I had three options:
- Black bars — do nothing, accept the letterboxing
- Vertical 4K — switch the whole composition to portrait
- Landscape + blurred fill — center the vertical footage and fill the sides with a blurred version of the same frame
I went with the blurred fill. And then Claude suggested making the fill animated — a slow hue-cycling wash with oversaturated color, a breathing scale, and a gentle rotation drift. It matches the tie-dye shirt energy and gives the video subtle life without competing with what's happening in the center.
Step 5: The Thumbnail
The thumbnail is a separate Remotion composition (1280×720, one frame). Same project, same Studio — no Photoshop, no Canva. Export is one command.
The layout:
- A still frame from the footage on the left (~60% of the width)
- A dark sidebar on the right with an orange left border
- Stacked Oswald Bold text: $5K / MONTH → Passive Income → The Plan
We ran an A/B: variant A used a more energetic expression from clip 3, variant B used a calmer direct look from clip 4. The main challenge was framing vertical source footage into a horizontal crop without cutting off my chin — solved by constraining the photo zone width and adjusting the vertical crop position with a single CSS value.
I picked A. More stopping power for the hook.
The Conversation-First Editing Loop
The part I want to highlight most: all of this happens through conversation. I'm not writing React code from scratch. I describe what I want — "make the sidebars kind of trippy without distracting from the video", "add an accent line at the moment I say 'freedom number'", "try the A/B with my chin fully in the shot" — and the system figures out the implementation.
The Studio preview makes everything immediately verifiable. If it looks wrong, I say what's wrong and it gets fixed. If it looks right, we move on.
Each project also gets faster because the patterns carry over. The transcript-to-overlay system I built for this video is reusable on every video from here on. That's the kind of compounding I'm looking for.
What's Next
This video is the first in a series. I'll be tracking progress toward the $5K/month goal publicly — what's working, what's not, actual numbers over time.
Want a say in where I go deeper next? Vote on the next pillar video. And if you've tried AI-assisted video editing — or you're curious where to start — I'd love to hear where you're at.
Built with: Remotion 4.0 · ffmpeg 8.1 · OpenAI Whisper (medium.en) · Claude Opus