Skip to content

Latest commit

 

History

History
562 lines (409 loc) · 16.9 KB

File metadata and controls

562 lines (409 loc) · 16.9 KB
layout title parent nav_order
default
Chapter 2: Audio Processing Fundamentals
Whisper.cpp Tutorial
2

Chapter 2: Audio Processing Fundamentals

Welcome to Chapter 2: Audio Processing Fundamentals. In this part of Whisper.cpp Tutorial: High-Performance Speech Recognition in C/C++, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Welcome back! Now that you have Whisper.cpp up and running, let's dive into the fascinating world of audio processing. Understanding how audio works is crucial for getting the best results from speech recognition systems. In this chapter, we'll explore the fundamentals of digital audio and how Whisper.cpp processes sound.

What Makes Audio Processing Important?

Imagine trying to read a book where all the letters are jumbled together - that's what raw audio looks like to a computer! Audio processing transforms continuous sound waves into a format that machine learning models can understand and work with.

Digital Audio Basics

1. Analog vs Digital Audio

Analog Audio:

  • Continuous sound waves
  • Like a vinyl record groove
  • Infinite resolution

Digital Audio:

  • Discrete samples of the sound wave
  • Like dots on a connect-the-dots picture
  • Finite resolution determined by sampling rate

2. Key Audio Properties

# Understanding audio properties
sample_rate = 16000  # Samples per second (Hz)
bit_depth = 16       # Bits per sample
channels = 1         # Mono = 1, Stereo = 2
duration = 10        # Seconds

# Calculate file size
samples = sample_rate * duration * channels
bytes_per_sample = bit_depth / 8
file_size = samples * bytes_per_sample

print(f"Audio file will have {samples} samples")
print(f"Estimated file size: {file_size} bytes")

How Audio Sampling Works

The Sampling Theorem

The Nyquist-Shannon sampling theorem states: "To accurately reconstruct a signal, you must sample at least twice the highest frequency in the signal."

For human speech (highest frequency ~8kHz), we need:

  • Minimum sampling rate: 16kHz (2 × 8kHz)
  • Whisper's sampling rate: 16kHz (perfect for speech)

Visualizing Sampling

Analog Wave: ~~~~~~~~~~~~~~~~~~~~~~~~~
                   ^
                  / \
                 /   \
                /     \
               /       \
              /         \
             /           \
            /             \
           /               \
          /                 \
         /                   \
        /                     \
       /                       \
      /                         \
     /                           \
    /                             \
   /                               \
  v                                 v

Digital Samples: •••••••••••••••••••••••
               ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
            16kHz sampling points

Audio Formats and Codecs

Common Audio Formats

Format Compression Quality Whisper Support
WAV Uncompressed Highest ✅ Native
FLAC Lossless High ✅ Via conversion
MP3 Lossy Good ✅ Via ffmpeg
M4A/AAC Lossy Very Good ✅ Via ffmpeg
OGG Lossy Good ✅ Via ffmpeg

Converting Audio for Whisper

# Convert MP3 to WAV (Whisper's preferred format)
ffmpeg -i input.mp3 -acodec pcm_s16le -ar 16000 -ac 1 output.wav

# Convert video to audio
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

# Convert with specific parameters
ffmpeg -i input.m4a \
       -acodec pcm_s16le \  # 16-bit PCM
       -ar 16000 \          # 16kHz sample rate
       -ac 1 \              # Mono channel
       -f wav \             # WAV format
       output.wav

How Whisper Processes Audio

Step 1: Audio Loading

// Load audio file
std::vector<float> audio_data;
if (whisper_pcm_to_mel(ctx, audio_path, 0, whisper_n_samples(ctx)) != 0) {
    std::cerr << "Failed to load audio" << std::endl;
    return 1;
}

Under the hood:

  1. File parsing: Reads audio file headers
  2. Sample extraction: Converts to 16kHz float array
  3. Normalization: Scales samples to [-1, 1] range
  4. Mono conversion: Mixes stereo to mono if needed

Step 2: Feature Extraction

Whisper uses a Mel spectrogram to convert audio into a format the neural network can understand:

import librosa
import numpy as np

def extract_features(audio_path):
    # Load audio at 16kHz
    audio, sr = librosa.load(audio_path, sr=16000)

    # Extract Mel spectrogram
    mel_spec = librosa.feature.melspectrogram(
        y=audio,
        sr=sr,
        n_fft=400,      # Window size
        hop_length=160, # Hop size (10ms at 16kHz)
        n_mels=80       # Number of Mel bands
    )

    # Convert to log scale
    log_mel = librosa.power_to_db(mel_spec)

    return log_mel

# Shape: (80, time_steps)
features = extract_features("audio.wav")
print(f"Feature shape: {features.shape}")

Step 3: Model Processing

The Mel spectrogram is fed into the Whisper model:

Raw Audio (16kHz) → Mel Spectrogram (80x3000) → Whisper Encoder → Text Tokens → Final Text

Understanding Mel Spectrograms

What is a Mel Scale?

The Mel scale mimics human hearing perception:

  • Linear frequency: 0Hz, 100Hz, 200Hz, 300Hz...
  • Mel scale: 0 mel, 100 mel, 200 mel, 300 mel...
  • Formula: mel = 2595 × log₁₀(1 + f/700)
import numpy as np

def frequency_to_mel(frequency):
    """Convert frequency to Mel scale"""
    return 2595 * np.log10(1 + frequency / 700)

def mel_to_frequency(mel):
    """Convert Mel scale to frequency"""
    return 700 * (10**(mel/2595) - 1)

# Examples
print(f"1000Hz = {frequency_to_mel(1000):.0f} mel")
print(f"4000Hz = {frequency_to_mel(4000):.0f} mel")
print(f"100 mel = {mel_to_frequency(100):.0f} Hz")

Why Mel Spectrograms Work Better

  1. Perceptual relevance: Matches human hearing
  2. Dimensionality reduction: Fewer features than raw spectrograms
  3. Robustness: Less sensitive to noise and variations
  4. Computational efficiency: Smaller matrices to process

Audio Quality and Preprocessing

Common Audio Issues

def analyze_audio_quality(audio_path):
    """Analyze audio file for potential issues"""

    import librosa

    audio, sr = librosa.load(audio_path, sr=None)

    issues = []

    # Check sample rate
    if sr != 16000:
        issues.append(f"Sample rate {sr}Hz (should be 16000Hz)")

    # Check for silence
    if np.max(np.abs(audio)) < 0.01:
        issues.append("Audio appears to be silent")

    # Check for clipping
    if np.max(np.abs(audio)) > 0.95:
        issues.append("Audio may be clipped")

    # Check duration
    duration = len(audio) / sr
    if duration < 0.5:
        issues.append("Audio too short (< 0.5s)")
    elif duration > 30:
        issues.append("Audio too long (> 30s)")

    return issues

issues = analyze_audio_quality("recording.wav")
for issue in issues:
    print(f"⚠️  {issue}")

Audio Preprocessing Techniques

import librosa
import numpy as np

def preprocess_audio(audio_path):
    """Apply common audio preprocessing techniques"""

    # Load audio
    audio, sr = librosa.load(audio_path, sr=16000)

    # 1. Normalize volume
    audio = librosa.util.normalize(audio)

    # 2. Remove DC offset
    audio = audio - np.mean(audio)

    # 3. Apply light noise reduction (simple high-pass filter)
    audio = librosa.effects.preemphasis(audio, coef=0.97)

    # 4. Trim silence
    audio, _ = librosa.effects.trim(audio, top_db=20)

    return audio, sr

# Preprocess and save
processed_audio, sr = preprocess_audio("input.wav")
librosa.output.write_wav("processed.wav", processed_audio, sr)

Working with Different Audio Sources

1. Microphone Recording

# Python recording with sounddevice
import sounddevice as sd
import soundfile as sf

def record_audio(duration=5, sample_rate=16000):
    """Record audio from microphone"""

    print("🎤 Recording...")

    # Record audio
    audio = sd.rec(int(duration * sample_rate),
                   samplerate=sample_rate,
                   channels=1,
                   dtype='float32')
    sd.wait()  # Wait for recording to finish

    print("✅ Recording complete")

    return audio.flatten()

# Record and save
audio = record_audio(duration=10)
sf.write("recording.wav", audio, 16000)

2. Streaming Audio

// C++ streaming audio processing
#include <whisper.h>
#include <portaudio.h>

// Callback for audio streaming
static int audio_callback(const void *inputBuffer, void *outputBuffer,
                         unsigned long framesPerBuffer,
                         const PaStreamCallbackTimeInfo* timeInfo,
                         PaStreamCallbackFlags statusFlags,
                         void *userData) {

    float *audio_data = (float*)inputBuffer;
    AudioBuffer *buffer = (AudioBuffer*)userData;

    // Add new audio to buffer
    buffer->add_samples(audio_data, framesPerBuffer);

    // Process in chunks when buffer is full
    if (buffer->is_full()) {
        // Run Whisper inference on chunk
        process_audio_chunk(buffer->get_data());
        buffer->clear();
    }

    return paContinue;
}

3. Real-time Processing

import pyaudio
import numpy as np
import threading
import queue

class RealtimeTranscriber:
    def __init__(self):
        self.audio_queue = queue.Queue()
        self.is_running = False

    def audio_callback(self, in_data, frame_count, time_info, status):
        """Process audio chunks as they arrive"""
        audio_data = np.frombuffer(in_data, dtype=np.float32)

        # Add to processing queue
        self.audio_queue.put(audio_data)

        return (in_data, pyaudio.paContinue)

    def transcription_worker(self):
        """Process audio chunks and transcribe"""
        while self.is_running:
            try:
                # Get audio chunk with timeout
                audio_chunk = self.audio_queue.get(timeout=1.0)

                # Process with Whisper
                result = self.transcribe_chunk(audio_chunk)

                if result:
                    print(f"🎤 {result}")

            except queue.Empty:
                continue

    def start(self):
        """Start real-time transcription"""
        self.is_running = True

        # Start worker thread
        worker_thread = threading.Thread(target=self.transcription_worker)
        worker_thread.daemon = True
        worker_thread.start()

        # Start audio stream
        # ... audio setup code ...

    def stop(self):
        """Stop transcription"""
        self.is_running = False

Audio Analysis and Debugging

Visualizing Audio

import matplotlib.pyplot as plt
import librosa.display

def visualize_audio(audio_path):
    """Create visualizations of audio data"""

    audio, sr = librosa.load(audio_path, sr=16000)

    # Create figure with subplots
    fig, axes = plt.subplots(3, 1, figsize=(12, 8))

    # 1. Waveform
    axes[0].plot(audio)
    axes[0].set_title('Waveform')
    axes[0].set_xlabel('Samples')
    axes[0].set_ylabel('Amplitude')

    # 2. Spectrogram
    D = librosa.stft(audio)
    S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)
    librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='hz',
                           ax=axes[1])
    axes[1].set_title('Spectrogram')

    # 3. Mel spectrogram
    mel_spec = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=80)
    mel_db = librosa.power_to_db(mel_spec, ref=np.max)
    librosa.display.specshow(mel_db, sr=sr, x_axis='time', y_axis='mel',
                           ax=axes[2])
    axes[2].set_title('Mel Spectrogram')

    plt.tight_layout()
    plt.savefig('audio_analysis.png')
    plt.show()

visualize_audio("recording.wav")

Debugging Audio Issues

# Check audio file properties
ffprobe recording.wav

# Convert to different formats for testing
ffmpeg -i recording.wav -acodec pcm_s16le -ar 16000 test.wav

# Test with different Whisper models
./main -m models/ggml-tiny.en.bin -f recording.wav
./main -m models/ggml-base.en.bin -f recording.wav
./main -m models/ggml-small.en.bin -f recording.wav

# Check system audio setup
arecord -l  # Linux
system_profiler SPAudioDataType  # macOS

Performance Considerations

Audio Format Impact

Format Load Time Processing Speed Quality
WAV (PCM) Fastest Fastest Best
FLAC Fast Fast Best
M4A/AAC Medium Medium Very Good
MP3 Slowest Medium Good

Memory Usage

// Estimate memory usage for different models
size_t estimate_memory_usage(const char* model_path) {
    struct whisper_context *ctx = whisper_init_from_file(model_path);
    if (!ctx) return 0;

    size_t model_size = whisper_model_n_bytes(ctx);
    size_t context_size = whisper_n_context(ctx) * sizeof(float);
    size_t mel_size = whisper_n_mels(ctx) * whisper_n_len(ctx) * sizeof(float);

    whisper_free(ctx);

    return model_size + context_size + mel_size;
}

size_t memory_needed = estimate_memory_usage("models/ggml-base.en.bin");
printf("Estimated memory usage: %.1f MB\n", memory_needed / (1024.0 * 1024.0));

What We've Learned

Fantastic progress! 🎉 You've now mastered:

  1. Digital Audio Fundamentals - Sampling rates, bit depths, and formats
  2. Audio Processing Pipeline - From raw audio to Mel spectrograms
  3. Whisper's Audio Requirements - 16kHz mono audio for optimal performance
  4. Audio Conversion Tools - Using ffmpeg for format conversion
  5. Real-time Audio Processing - Streaming and chunked processing techniques
  6. Audio Analysis - Visualizing and debugging audio data
  7. Performance Optimization - Format selection and memory management

Next Steps

Now that you understand how audio processing works, let's explore the neural network architecture that makes Whisper so powerful. In Chapter 3: Model Architecture & GGML, we'll dive into the technical details of how Whisper processes audio into text.


Try these exercises:

  1. Record audio at different sample rates and compare transcription quality
  2. Create a script to batch-convert a folder of audio files
  3. Build a simple audio visualizer that shows the Mel spectrogram
  4. Experiment with audio preprocessing techniques on noisy recordings

How does understanding audio processing change how you think about speech recognition? 🔊

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for audio, librosa, self so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

  • coupling core logic too tightly to one implementation path
  • missing the handoff boundaries between setup, execution, and validation
  • shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 2: Audio Processing Fundamentals as an operating subsystem inside Whisper.cpp Tutorial: High-Performance Speech Recognition in C/C++, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around print, issues, recording as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 2: Audio Processing Fundamentals usually follows a repeatable control path:

  1. Context bootstrap: initialize runtime config and prerequisites for audio.
  2. Input normalization: shape incoming data so librosa receives stable contracts.
  3. Core execution: run the main logic branch and propagate intermediate state through self.
  4. Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
  5. Output composition: return canonical result payloads for downstream consumers.
  6. Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

  • View Repo Why it matters: authoritative reference on View Repo (github.com).

Suggested trace strategy:

  • search upstream code for audio and librosa to map concrete implementation paths
  • compare docs claims against actual runtime/config code before reusing patterns in production

Chapter Connections