Voice activity detection
Voice activity detection (VAD) is the technique of detecting when human speech is present in an audio stream, separating talking from silence and background noise so that downstream processing only runs on the parts that matter.
Last updated May 24, 2026
Voice activity detection (VAD) is the technique of detecting when human speech is present in an audio stream, separating talking from silence and background noise so that downstream processing only runs on the parts that matter.
How it works
VAD continuously inspects the incoming audio and flags the segments that contain speech. Those segments are passed on to streaming transcription and speaker diarization, while silence and noise are skipped.
Why it matters
In a live setting, VAD keeps latency and cost down — there’s no point transcribing dead air — and it improves accuracy by giving the transcriber cleaner input. For a real-time tool like Canary, fast and reliable VAD on the captured system audio is part of what keeps the live rolling summary responsive enough to be useful the moment your name is called.