| layout | title | parent | nav_order |
|---|---|---|---|
default |
Chapter 2: Audio Processing Fundamentals |
Whisper.cpp Tutorial |
2 |
Welcome to Chapter 2: Audio Processing Fundamentals. In this part of Whisper.cpp Tutorial: High-Performance Speech Recognition in C/C++, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Welcome back! Now that you have Whisper.cpp up and running, let's dive into the fascinating world of audio processing. Understanding how audio works is crucial for getting the best results from speech recognition systems. In this chapter, we'll explore the fundamentals of digital audio and how Whisper.cpp processes sound.
Imagine trying to read a book where all the letters are jumbled together - that's what raw audio looks like to a computer! Audio processing transforms continuous sound waves into a format that machine learning models can understand and work with.
Analog Audio:
- Continuous sound waves
- Like a vinyl record groove
- Infinite resolution
Digital Audio:
- Discrete samples of the sound wave
- Like dots on a connect-the-dots picture
- Finite resolution determined by sampling rate
# Understanding audio properties
sample_rate = 16000 # Samples per second (Hz)
bit_depth = 16 # Bits per sample
channels = 1 # Mono = 1, Stereo = 2
duration = 10 # Seconds
# Calculate file size
samples = sample_rate * duration * channels
bytes_per_sample = bit_depth / 8
file_size = samples * bytes_per_sample
print(f"Audio file will have {samples} samples")
print(f"Estimated file size: {file_size} bytes")The Nyquist-Shannon sampling theorem states: "To accurately reconstruct a signal, you must sample at least twice the highest frequency in the signal."
For human speech (highest frequency ~8kHz), we need:
- Minimum sampling rate: 16kHz (2 × 8kHz)
- Whisper's sampling rate: 16kHz (perfect for speech)
Analog Wave: ~~~~~~~~~~~~~~~~~~~~~~~~~
^
/ \
/ \
/ \
/ \
/ \
/ \
/ \
/ \
/ \
/ \
/ \
/ \
/ \
/ \
/ \
/ \
v v
Digital Samples: •••••••••••••••••••••••
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
16kHz sampling points
| Format | Compression | Quality | Whisper Support |
|---|---|---|---|
| WAV | Uncompressed | Highest | ✅ Native |
| FLAC | Lossless | High | ✅ Via conversion |
| MP3 | Lossy | Good | ✅ Via ffmpeg |
| M4A/AAC | Lossy | Very Good | ✅ Via ffmpeg |
| OGG | Lossy | Good | ✅ Via ffmpeg |
# Convert MP3 to WAV (Whisper's preferred format)
ffmpeg -i input.mp3 -acodec pcm_s16le -ar 16000 -ac 1 output.wav
# Convert video to audio
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
# Convert with specific parameters
ffmpeg -i input.m4a \
-acodec pcm_s16le \ # 16-bit PCM
-ar 16000 \ # 16kHz sample rate
-ac 1 \ # Mono channel
-f wav \ # WAV format
output.wav// Load audio file
std::vector<float> audio_data;
if (whisper_pcm_to_mel(ctx, audio_path, 0, whisper_n_samples(ctx)) != 0) {
std::cerr << "Failed to load audio" << std::endl;
return 1;
}Under the hood:
- File parsing: Reads audio file headers
- Sample extraction: Converts to 16kHz float array
- Normalization: Scales samples to [-1, 1] range
- Mono conversion: Mixes stereo to mono if needed
Whisper uses a Mel spectrogram to convert audio into a format the neural network can understand:
import librosa
import numpy as np
def extract_features(audio_path):
# Load audio at 16kHz
audio, sr = librosa.load(audio_path, sr=16000)
# Extract Mel spectrogram
mel_spec = librosa.feature.melspectrogram(
y=audio,
sr=sr,
n_fft=400, # Window size
hop_length=160, # Hop size (10ms at 16kHz)
n_mels=80 # Number of Mel bands
)
# Convert to log scale
log_mel = librosa.power_to_db(mel_spec)
return log_mel
# Shape: (80, time_steps)
features = extract_features("audio.wav")
print(f"Feature shape: {features.shape}")The Mel spectrogram is fed into the Whisper model:
Raw Audio (16kHz) → Mel Spectrogram (80x3000) → Whisper Encoder → Text Tokens → Final Text
The Mel scale mimics human hearing perception:
- Linear frequency: 0Hz, 100Hz, 200Hz, 300Hz...
- Mel scale: 0 mel, 100 mel, 200 mel, 300 mel...
- Formula: mel = 2595 × log₁₀(1 + f/700)
import numpy as np
def frequency_to_mel(frequency):
"""Convert frequency to Mel scale"""
return 2595 * np.log10(1 + frequency / 700)
def mel_to_frequency(mel):
"""Convert Mel scale to frequency"""
return 700 * (10**(mel/2595) - 1)
# Examples
print(f"1000Hz = {frequency_to_mel(1000):.0f} mel")
print(f"4000Hz = {frequency_to_mel(4000):.0f} mel")
print(f"100 mel = {mel_to_frequency(100):.0f} Hz")- Perceptual relevance: Matches human hearing
- Dimensionality reduction: Fewer features than raw spectrograms
- Robustness: Less sensitive to noise and variations
- Computational efficiency: Smaller matrices to process
def analyze_audio_quality(audio_path):
"""Analyze audio file for potential issues"""
import librosa
audio, sr = librosa.load(audio_path, sr=None)
issues = []
# Check sample rate
if sr != 16000:
issues.append(f"Sample rate {sr}Hz (should be 16000Hz)")
# Check for silence
if np.max(np.abs(audio)) < 0.01:
issues.append("Audio appears to be silent")
# Check for clipping
if np.max(np.abs(audio)) > 0.95:
issues.append("Audio may be clipped")
# Check duration
duration = len(audio) / sr
if duration < 0.5:
issues.append("Audio too short (< 0.5s)")
elif duration > 30:
issues.append("Audio too long (> 30s)")
return issues
issues = analyze_audio_quality("recording.wav")
for issue in issues:
print(f"⚠️ {issue}")import librosa
import numpy as np
def preprocess_audio(audio_path):
"""Apply common audio preprocessing techniques"""
# Load audio
audio, sr = librosa.load(audio_path, sr=16000)
# 1. Normalize volume
audio = librosa.util.normalize(audio)
# 2. Remove DC offset
audio = audio - np.mean(audio)
# 3. Apply light noise reduction (simple high-pass filter)
audio = librosa.effects.preemphasis(audio, coef=0.97)
# 4. Trim silence
audio, _ = librosa.effects.trim(audio, top_db=20)
return audio, sr
# Preprocess and save
processed_audio, sr = preprocess_audio("input.wav")
librosa.output.write_wav("processed.wav", processed_audio, sr)# Python recording with sounddevice
import sounddevice as sd
import soundfile as sf
def record_audio(duration=5, sample_rate=16000):
"""Record audio from microphone"""
print("🎤 Recording...")
# Record audio
audio = sd.rec(int(duration * sample_rate),
samplerate=sample_rate,
channels=1,
dtype='float32')
sd.wait() # Wait for recording to finish
print("✅ Recording complete")
return audio.flatten()
# Record and save
audio = record_audio(duration=10)
sf.write("recording.wav", audio, 16000)// C++ streaming audio processing
#include <whisper.h>
#include <portaudio.h>
// Callback for audio streaming
static int audio_callback(const void *inputBuffer, void *outputBuffer,
unsigned long framesPerBuffer,
const PaStreamCallbackTimeInfo* timeInfo,
PaStreamCallbackFlags statusFlags,
void *userData) {
float *audio_data = (float*)inputBuffer;
AudioBuffer *buffer = (AudioBuffer*)userData;
// Add new audio to buffer
buffer->add_samples(audio_data, framesPerBuffer);
// Process in chunks when buffer is full
if (buffer->is_full()) {
// Run Whisper inference on chunk
process_audio_chunk(buffer->get_data());
buffer->clear();
}
return paContinue;
}import pyaudio
import numpy as np
import threading
import queue
class RealtimeTranscriber:
def __init__(self):
self.audio_queue = queue.Queue()
self.is_running = False
def audio_callback(self, in_data, frame_count, time_info, status):
"""Process audio chunks as they arrive"""
audio_data = np.frombuffer(in_data, dtype=np.float32)
# Add to processing queue
self.audio_queue.put(audio_data)
return (in_data, pyaudio.paContinue)
def transcription_worker(self):
"""Process audio chunks and transcribe"""
while self.is_running:
try:
# Get audio chunk with timeout
audio_chunk = self.audio_queue.get(timeout=1.0)
# Process with Whisper
result = self.transcribe_chunk(audio_chunk)
if result:
print(f"🎤 {result}")
except queue.Empty:
continue
def start(self):
"""Start real-time transcription"""
self.is_running = True
# Start worker thread
worker_thread = threading.Thread(target=self.transcription_worker)
worker_thread.daemon = True
worker_thread.start()
# Start audio stream
# ... audio setup code ...
def stop(self):
"""Stop transcription"""
self.is_running = Falseimport matplotlib.pyplot as plt
import librosa.display
def visualize_audio(audio_path):
"""Create visualizations of audio data"""
audio, sr = librosa.load(audio_path, sr=16000)
# Create figure with subplots
fig, axes = plt.subplots(3, 1, figsize=(12, 8))
# 1. Waveform
axes[0].plot(audio)
axes[0].set_title('Waveform')
axes[0].set_xlabel('Samples')
axes[0].set_ylabel('Amplitude')
# 2. Spectrogram
D = librosa.stft(audio)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)
librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='hz',
ax=axes[1])
axes[1].set_title('Spectrogram')
# 3. Mel spectrogram
mel_spec = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=80)
mel_db = librosa.power_to_db(mel_spec, ref=np.max)
librosa.display.specshow(mel_db, sr=sr, x_axis='time', y_axis='mel',
ax=axes[2])
axes[2].set_title('Mel Spectrogram')
plt.tight_layout()
plt.savefig('audio_analysis.png')
plt.show()
visualize_audio("recording.wav")# Check audio file properties
ffprobe recording.wav
# Convert to different formats for testing
ffmpeg -i recording.wav -acodec pcm_s16le -ar 16000 test.wav
# Test with different Whisper models
./main -m models/ggml-tiny.en.bin -f recording.wav
./main -m models/ggml-base.en.bin -f recording.wav
./main -m models/ggml-small.en.bin -f recording.wav
# Check system audio setup
arecord -l # Linux
system_profiler SPAudioDataType # macOS| Format | Load Time | Processing Speed | Quality |
|---|---|---|---|
| WAV (PCM) | Fastest | Fastest | Best |
| FLAC | Fast | Fast | Best |
| M4A/AAC | Medium | Medium | Very Good |
| MP3 | Slowest | Medium | Good |
// Estimate memory usage for different models
size_t estimate_memory_usage(const char* model_path) {
struct whisper_context *ctx = whisper_init_from_file(model_path);
if (!ctx) return 0;
size_t model_size = whisper_model_n_bytes(ctx);
size_t context_size = whisper_n_context(ctx) * sizeof(float);
size_t mel_size = whisper_n_mels(ctx) * whisper_n_len(ctx) * sizeof(float);
whisper_free(ctx);
return model_size + context_size + mel_size;
}
size_t memory_needed = estimate_memory_usage("models/ggml-base.en.bin");
printf("Estimated memory usage: %.1f MB\n", memory_needed / (1024.0 * 1024.0));Fantastic progress! 🎉 You've now mastered:
- Digital Audio Fundamentals - Sampling rates, bit depths, and formats
- Audio Processing Pipeline - From raw audio to Mel spectrograms
- Whisper's Audio Requirements - 16kHz mono audio for optimal performance
- Audio Conversion Tools - Using ffmpeg for format conversion
- Real-time Audio Processing - Streaming and chunked processing techniques
- Audio Analysis - Visualizing and debugging audio data
- Performance Optimization - Format selection and memory management
Now that you understand how audio processing works, let's explore the neural network architecture that makes Whisper so powerful. In Chapter 3: Model Architecture & GGML, we'll dive into the technical details of how Whisper processes audio into text.
Try these exercises:
- Record audio at different sample rates and compare transcription quality
- Create a script to batch-convert a folder of audio files
- Build a simple audio visualizer that shows the Mel spectrogram
- Experiment with audio preprocessing techniques on noisy recordings
How does understanding audio processing change how you think about speech recognition? 🔊
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for audio, librosa, self so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 2: Audio Processing Fundamentals as an operating subsystem inside Whisper.cpp Tutorial: High-Performance Speech Recognition in C/C++, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around print, issues, recording as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 2: Audio Processing Fundamentals usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
audio. - Input normalization: shape incoming data so
librosareceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
self. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- View Repo
Why it matters: authoritative reference on
View Repo(github.com).
Suggested trace strategy:
- search upstream code for
audioandlibrosato map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production