Guides · Writing

Voiceover script writing guide

Updated July 2026 · 6 min read

By Zohaib Akeel · Cosette Team ·

Writer drafting a voiceover script in a notebook at a desk
Voiceover scripts are written for the ear — short sentences and clear hooks.

Voiceover scripts are not blog posts read aloud. They are timed instructions for the ear — where every sentence has a job and every pause is deliberate. Weak scripts make even premium TTS sound hollow; tight scripts make modest voices feel confident on YouTube, in courses, and on IVR prompts.

This guide teaches hooks, rhythm, readability, and revision passes that shrink word count while increasing clarity. Write in Google Docs, then paste sections into Cosette to hear problems your eyes skip.

Hook formulas that work on audio

Open with stakes: "If you upload Hindi videos without fixing this, retention drops in ten seconds." Avoid slow intros — context comes after the promise.

  • Question hook for explainers
  • Contrarian hook for finance and tech
  • Story hook for documentaries — one concrete detail

Sentence length and cadence

Average 8–14 words per sentence for narration. Alternate short punchy lines with one longer explanatory sentence. Monotonous length creates robotic delivery even on neural voices.

Transitions listeners can follow

Use signposts: "First," "The real problem is," "Here is the fix." Visual cuts may disappear on audio-only platforms — transitions must be verbal.

Numbers, dates, and acronyms

Write "twenty twenty-six" if that is how you want it spoken. Expand acronyms on first use unless audience is expert. For Hindi/Urdu, decide Latin vs native script for brand names once in a style guide.

Revision passes

  1. Cut 15% words without losing facts
  2. Read aloud and mark breath breaks
  3. Generate audio — note every stumble
  4. Fix pronunciation spellings
  5. Lock version v1.0 before edit

Scripts for different lengths

60-second ad ≈ 150 words. Five-minute YouTube ≈ 750 words. Ten-minute deep dive ≈ 1,400 words at conversational pace. Adjust for speed setting in TTS.

Collaboration and versioning

Name files script_ep12_v3.txt; note voice ID and speed in header comments. Editors need sync markers for retakes.

From script to publish

Section scripts match section edits — generate audio per H2 block. Assemble in timeline; swap one block without redoing whole video.

See natural AI voice tips for delivery polish.

Voice selection in production

Cast voice like hiring an actor: record three candidates on the same paragraph, blind-test with five listeners, pick winner by score not gut. Document choice in style guide with forbidden alternatives to prevent drift.

Seasonal refreshes (holiday ads, exam pushes) can keep the same voice — consistency builds brand equity. Swap voice only for deliberate spin-offs labeled as such.

When A/B testing hooks, change script not voice — otherwise you confound variables.

Key takeaways for VO scripts

Write spoken grammar, cut fifteen percent on revision pass, and time scripts with a stopwatch. Generate audio only after read-aloud passes. See our natural AI voice tips and pronunciation guide for polish steps.

Timing scripts by format

60 s ad ≈ 150 words. 5 min YouTube ≈ 750 words. 10 min deep dive ≈ 1,400 words at conversational TTS speed. Always timer-read your hook.

Collaboration with editors

Mark script sections S1, S2 for timeline sync. Note pronunciation spellings inline for TTS: [pron: koh-ZET]. Version file names with date.

Revision passes that shrink word count

First draft captures facts; second pass cuts fifteen percent by removing filler (“basically,” “in order to”). Third pass reads aloud — if you need two breaths, split the sentence. Fourth pass marks pronunciation spellings for TTS: brand names, acronyms, and numbers in spoken form.

Time hooks with a stopwatch: YouTube intros should deliver the promise within eight seconds. Ads need the offer before the fifteen-second mark on most platforms.

Collaborating with editors and clients

Label sections S1, S2 in the script so timeline comments map cleanly. Inline notes like [pause] or [emphasis] help you; avoid stage directions the engine cannot perform. Version filenames with dates — script_2026-07-04_v3.txt — so audio exports stay traceable.

Cold-open patterns that work on TTS

Start with the payoff or conflict, then backfill context — “Three banks failed the same week because of one spreadsheet error” beats “Today we will discuss banking.” TTS delivers declarative cold opens cleanly when sentences stay under eighteen words.

End sections with a bridge sentence teasing the next topic — retention tools on YouTube reward seamless chapter flow.

Numbers read aloud

Write “twenty twenty-six” versus “two thousand twenty-six” based on house style — TTS will not choose for you. Percentages in finance scripts need spoken form: “five percent” not “5%” if the engine misreads symbols. Phone numbers benefit from grouping pauses: “nine eight seven, six five four, three two one zero.”

Read time versus wall time

TTS read time runs shorter than human read time for the same word count — calibrate with a stopwatch on generated audio, not only WPM tables. Pad hooks by two seconds for vertical formats with fast cuts.

Legal hold wording

Regulated scripts need exact legal sentences — never paraphrase compliance lines for “flow.” TTS will read approved text faithfully if you paste faithfully.

Table and list narration

Tables read poorly in TTS — rewrite as spoken lists with parallel structure. “Row one: revenue up five percent” beats reading grid coordinates aloud.

Closing production checklist

Before generate, read aloud once, timer-read hook, mark pronunciation inline, cut fifteen percent filler on second pass, and label sections for editors. Numbers and legal lines need spoken form exactly as approved. TTS rewards oral grammar — rewrite nested clauses. Version filename with date. Script quality ceiling beats voice shopping: fix text before swapping avatars. Deliver script PDF with audio for client archives.

One habit to keep

Document voice ID, script version, and export date in every project folder before upload. Future you — and any freelancer — ship faster when settings are not guesswork. That habit prevents most inconsistent TTS output across a series.

Frequently asked questions

How long should a YouTube VO script be?

About 130–150 words per minute at normal TTS speed.

Should I write in spoken or written grammar?

Spoken — contractions and direct address where appropriate.

How do I test before full generate?

Read aloud once; generate only the hook; fix before body.

Can I use AI to draft scripts?

Yes — but rewrite for ear and verify facts; AdSense rewards original value.

Urdu script tips?

Shorter clauses; preview Nastaliq punctuation; test English loanwords separately.

Try Cosette free

Paste your script and compare natural voices in seconds.

Open the generator