AI & Modern ToolsJun 3, 2026 · 7 min read

How I Edit My YouTube Videos with Claude and Remotion

From raw footage to a polished 4K video — audio normalized, transitions in, captions timed to the transcript, and a thumbnail — mostly through conversation with AI.

I just edited a 4K YouTube video — audio cleaned and leveled, clips merged with transitions, captions timed to my own words, a thumbnail rendered — and I never opened a video editor. I talked to an AI instead.

Here's the why first. I'm building toward $5K/month in passive income by 2033, and one of the pillars I'm betting on is this YouTube channel — which means I actually have to produce videos. Around a full-time job. On limited hours. So I built a system that takes me from raw footage to a finished, captioned, 4K video almost entirely through conversation.

Here's how it works.

The Workflow at a Glance

Shoot the clips on my phone
Normalize the audio to YouTube standards (automatic)
Merge the clips with transitions in Remotion
Transcribe the audio with Whisper
Use the transcript to place accent text at exact moments
Build the thumbnail
Render

The entire thing — minus the shooting — happens inside a single Remotion project that I can edit just by talking to Claude. Remotion is free, and Claude puts it to work for me.

Step 1: Audio Normalization (Automatic)

YouTube has a loudness standard: -14 LUFS. If your video is too quiet or too loud, YouTube re-levels it on its end and your audio sounds off.

Rather than deal with that manually, I have a script that runs every clip through ffmpeg with a filter chain:

A high-pass filter at 120 Hz to kill wind rumble from recording outdoors
An FFT denoiser to clean up broadband hiss
A loudnorm filter targeting -14 LUFS / -1 dBTP

All four clips process in one shot. The video stream is copied untouched — only the audio gets re-encoded. So I keep full 4K quality while the audio gets cleaned and leveled to spec automatically.

Step 2: Merging Clips with Transitions

Remotion is a tool that lets you build videos in React. Instead of dragging clips around in a timeline, you write a composition in code. The clips are joined with 0.5-second crossfade transitions using Remotion's TransitionSeries.

What makes this powerful is the live Studio preview. I open it in my browser and scrub through the full merged video in real time. If I want to trim the beginning off a clip, I change one number — and the preview updates immediately. No re-exporting. No waiting.

Step 3: The Transcript System — Read, Then Place

This is the part I'm most excited about, because it creates a feedback loop between what I said and what appears on screen.

After the clips are normalized, I run them through Whisper (medium.en) — OpenAI's speech-to-text model. This produces a timestamped subtitle file for each clip.

A Python script then assembles those four files into a single markdown transcript with final-timeline frame numbers — accounting for the crossfade overlaps so the timestamps reflect the actual merged video, not the individual clips.

- [2:47] (f5019)  The first part of this is trading.
- [3:58] (f7149)  The second one is this YouTube channel.
- [5:24] (f9736)  The third vehicle is going to be a business...

Those frame numbers map directly to text overlay entries in the composition. When I want an accent line at the exact moment I say "The first part of this is trading," I look up the frame and add one line to the code:

{ text: "Pillar 1 — Trading", style: "fact", startFrame: 5019, durationInFrames: 150 }

The overlay appears in the Studio preview instantly. I ended up with eight accent lines across this video, each locked to a spoken beat:

8

accent lines placed from transcript

Time	Text	Style
0:03	Passive Income	Orange
1:03	Freedom Number	White
2:10	50 / 30 / 20	White
2:48	Pillar 1 — Trading	White
3:58	Pillar 2 — YouTube	White
4:44	Neo-pioneer.com	Orange
5:24	Pillar 3 — A Business	White
5:54	Goal: Systems by 2033	Orange

Reading the transcript, deciding where to put text, and having Claude write the entries took about 10 minutes.

Step 4: I Shot Vertical. On Purpose (Sort Of).

My older videos were shot landscape — standard horizontal framing. This batch I shot on my phone, holding it upright. Vertical footage dropped into a 3840×2160 landscape frame gives you a tiny picture with black bars on both sides, which looks bad.

I had three options:

Black bars — do nothing, accept the letterboxing
Vertical 4K — switch the whole composition to portrait
Landscape + blurred fill — center the vertical footage and fill the sides with a blurred version of the same frame

I went with the blurred fill. And then Claude suggested making the fill animated — a slow hue-cycling wash with oversaturated color, a breathing scale, and a gentle rotation drift. It matches the tie-dye shirt energy and gives the video subtle life without competing with what's happening in the center.

Step 5: The Thumbnail

The thumbnail is a separate Remotion composition (1280×720, one frame). Same project, same Studio — no Photoshop, no Canva. Export is one command.

The layout:

A still frame from the footage on the left (~60% of the width)
A dark sidebar on the right with an orange left border
Stacked Oswald Bold text: $5K / MONTH → Passive Income → The Plan

We ran an A/B: variant A used a more energetic expression from clip 3, variant B used a calmer direct look from clip 4. The main challenge was framing vertical source footage into a horizontal crop without cutting off my chin — solved by constraining the photo zone width and adjusting the vertical crop position with a single CSS value.

I picked A. More stopping power for the hook.

The Conversation-First Editing Loop

The part I want to highlight most: all of this happens through conversation. I'm not writing React code from scratch. I describe what I want — "make the sidebars kind of trippy without distracting from the video", "add an accent line at the moment I say 'freedom number'", "try the A/B with my chin fully in the shot" — and the system figures out the implementation.

The Studio preview makes everything immediately verifiable. If it looks wrong, I say what's wrong and it gets fixed. If it looks right, we move on.

Each project also gets faster because the patterns carry over. The transcript-to-overlay system I built for this video is reusable on every video from here on. That's the kind of compounding I'm looking for.

What's Next

This video is the first in a series. I'll be tracking progress toward the $5K/month goal publicly — what's working, what's not, actual numbers over time.

Want a say in where I go deeper next? Vote on the next pillar video. And if you've tried AI-assisted video editing — or you're curious where to start — I'd love to hear where you're at.

Built with: Remotion 4.0 · ffmpeg 8.1 · OpenAI Whisper (medium.en) · Claude Opus

letter · sundays only

The Frontier

One short letter every Sunday. The week's best move across the five pillars. No filler, no ads.

Related · AI & Modern Tools

Keep building.

Wield5 min read

Hackers Don't Hack AI-Built Apps. They Just Log In.

Nobody hacked anything. They signed up like a normal customer, changed one number in the URL, and read someone else's data. That hole has a boring name, it is the number one hole on the web again this year, and AI-built apps ship with it constantly. Here is how to check yours in five minutes.

Jul 15, 2026→

Wield7 min read

I Made an AI Break Into My Own Website. Here Is What It Found.

I built the site, put in the guards, and still left holes I could not see. So I pointed an AI at it and told it to attack. Here is what it found, the one-line fix, and the habit that matters more than any patch.

Jul 11, 2026→

Wield6 min read

Own Your Keys: Vibe Code Without Handing Your Data to a Stranger

A flaw in n8n left tens of thousands of servers wide open, API keys and tokens included. Here is how I built a client's posting tool the secure way, kept the keys on the machine, and skipped the connectors, in an afternoon.

Jul 3, 2026→