Download complete websites from the Wayback Machine for offline viewing.
Wayback-Archive is a Python tool that downloads archived websites from the Wayback Machine and reconstructs them for fully functional offline viewing. It preserves all assets -- HTML, CSS, JavaScript, images, and fonts -- rewrites URLs to relative paths, and cleans up Wayback Machine artifacts so the result looks like the original site.
# Install
git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive
pip install -r config/requirements.txt
# Run
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
python3 -m wayback_archive.cli
# Preview
cd output && python3 -m http.server 8000
# Open http://localhost:8000- Full website download -- HTML, CSS, JS, images, fonts, and all linked assets
- Recursive link discovery -- Automatically follows links in HTML, CSS, and JS files
- Smart URL rewriting -- Converts all links to relative paths for local serving
- Timeframe fallback -- Searches nearby Wayback Machine timestamps when a resource returns 404
- Real-time progress logging -- Displays download status and file processing as it happens
- Google Fonts support -- Downloads Google Fonts CSS and font files locally, fixing CORS issues
- Font corruption detection -- Identifies and removes corrupted font files (HTML error pages served as fonts)
- CDN fallback -- Automatic fallback to CDN for critical libraries (e.g., jQuery) when Wayback Machine fails
- Data attribute processing -- Processes
data-*attributes containing URLs (videos, images, etc.)
- Icon group preservation -- Preserves all links in icon groups (social media, contact icons)
- Button link preservation -- Maintains styling and functionality of button links
- Cookie consent preservation -- Keeps cookie consent popups and functionality intact
- HTML minification -- Uses
minify-html(Python 3.14+ compatible) - JS/CSS minification -- Optional JavaScript and CSS minification via
rjsminandcssmin - Image compression -- Optional image optimization with Pillow
- Tracker/ad removal -- Strips analytics, ads, and external iframes
- Link cleanup -- Configurable external link removal with anchor preservation options
- www/non-www normalization -- Normalize domain variations automatically
| Capability | Wayback-Archive | wget | httrack |
|---|---|---|---|
| Wayback Machine URL rewriting | Yes | No | No |
| Wayback artifact cleanup | Yes | No | No |
| Timeframe fallback for 404s | Yes | No | No |
| Google Fonts localization | Yes | No | No |
| Font corruption detection | Yes | No | No |
| CDN fallback | Yes | No | No |
| HTML/CSS/JS minification | Yes | No | No |
| Tracker and ad removal | Yes | No | No |
data-* attribute processing |
Yes | No | No |
General-purpose tools like wget --mirror or httrack can download live websites, but they do not understand Wayback Machine URL structures, cannot clean up archive artifacts, and lack the specialized asset recovery that Wayback-Archive provides.
- Python 3.8 or higher
- pip
git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive
# Optional: create a virtual environment
python3 -m venv venv
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
pip install -r config/requirements.txtcd Wayback-Archive
pip install -e .
wayback-archive # Available as a CLI command after installationAll options are set via environment variables. You can also use a .env file.
| Variable | Description |
|---|---|
WAYBACK_URL |
The Wayback Machine URL to download |
| Variable | Default | Description |
|---|---|---|
OUTPUT_DIR |
./output |
Output directory for downloaded files |
| Variable | Default | Description |
|---|---|---|
OPTIMIZE_HTML |
true |
Minify HTML |
OPTIMIZE_IMAGES |
false |
Compress images |
MINIFY_JS |
false |
Minify JavaScript |
MINIFY_CSS |
false |
Minify CSS |
| Variable | Default | Description |
|---|---|---|
REMOVE_TRACKERS |
true |
Remove analytics and trackers |
REMOVE_ADS |
true |
Remove advertisements |
REMOVE_CLICKABLE_CONTACTS |
true |
Remove tel: and mailto: links |
REMOVE_EXTERNAL_IFRAMES |
false |
Remove external iframes |
| Variable | Default | Description |
|---|---|---|
REMOVE_EXTERNAL_LINKS_KEEP_ANCHORS |
true |
Remove external links, keep anchor text |
REMOVE_EXTERNAL_LINKS_REMOVE_ANCHORS |
false |
Remove external links and anchor elements |
MAKE_INTERNAL_LINKS_RELATIVE |
true |
Convert internal links to relative paths |
| Variable | Default | Description |
|---|---|---|
MAKE_NON_WWW |
true |
Convert www to non-www |
MAKE_WWW |
false |
Convert non-www to www |
KEEP_REDIRECTIONS |
false |
Keep redirect pages |
| Variable | Default | Description |
|---|---|---|
MAX_FILES |
unlimited | Limit number of files to download |
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export OUTPUT_DIR="./my_website"
export REMOVE_CLICKABLE_CONTACTS="false" # Keep email/phone links
python3 -m wayback_archive.cli$env:WAYBACK_URL = "https://web.archive.org/web/20250417203037/http://example.com/"
$env:OUTPUT_DIR = ".\my_website"
$env:REMOVE_CLICKABLE_CONTACTS = "false"
python -m wayback_archive.cliset WAYBACK_URL=https://web.archive.org/web/20250417203037/http://example.com/
set OUTPUT_DIR=.\my_website
set REMOVE_CLICKABLE_CONTACTS=false
python -m wayback_archive.cliDownload a limited number of files to verify everything works:
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export MAX_FILES=5
python3 -m wayback_archive.cli- Initial download -- Fetches the main page from the Wayback Machine
- Link extraction -- Parses HTML to find all referenced assets (links, images, CSS, JS)
- CSS processing -- Extracts font URLs, background images, and
@importstatements; downloads Google Fonts locally; detects corrupted font files - JS processing -- Extracts dynamically loaded resources from JavaScript
- Data attributes -- Scans
data-*attributes for additional asset URLs - Iterative crawling -- Continues discovering and downloading resources until the queue is empty
- Timeframe fallback -- For 404 responses, searches nearby Wayback Machine timestamps
- URL rewriting -- Converts all URLs to relative paths for offline serving
- Preservation -- Maintains icon groups, button links, and cookie consent functionality
Wayback-Archive/
wayback_archive/ # Main package
__init__.py
__main__.py
cli.py # CLI entry point
config.py # Environment variable configuration
downloader.py # Core download and processing engine
config/
requirements.txt # Runtime dependencies
requirements-dev.txt # Development dependencies
setup.py # Package setup
pytest.ini # Test configuration
tests/ # Test suite
docs/ # Documentation
LICENSE # GPL-3.0
README.md
pip install -r config/requirements-dev.txt
# Run tests
pytest
# Run tests with coverage
pytest --cov=wayback_archivepython3 -m http.server 8080 # Use a different port- Google Fonts: Downloaded automatically to avoid CORS issues
- Corrupted fonts: Detected and removed from CSS automatically
- Missing fonts: Some fonts may not exist in the Wayback Machine archive
See Font Loading Research Notes for details.
- Icon groups (social media, contacts) are preserved automatically
- Button links with
sppb-btnorbtnclasses are preserved - Set
REMOVE_CLICKABLE_CONTACTS=falseto keeptel:andmailto:links
The tool includes automatic CDN fallback for critical libraries. If a file fails to download from the Wayback Machine, it will attempt to fetch it from a CDN.
| Package | Purpose |
|---|---|
| requests | HTTP client |
| beautifulsoup4 | HTML parsing |
| lxml | Fast HTML/XML parser |
| minify-html | HTML minification |
| cssmin | CSS minification |
| rjsmin | JS minification |
| Pillow | Image optimization |
| python-dotenv | .env file support |
Contributions are welcome. Please feel free to submit a Pull Request.
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).