GitHub - GeiserX/Wayback-Archive: Download complete websites from the Wayback Machine with full asset preservation for offline viewing

Download complete websites from the Wayback Machine for offline viewing.

Wayback-Archive is a Python tool that downloads archived websites from the Wayback Machine and reconstructs them for fully functional offline viewing. It preserves all assets -- HTML, CSS, JavaScript, images, and fonts -- rewrites URLs to relative paths, and cleans up Wayback Machine artifacts so the result looks like the original site.

Quick Start

# Install
git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive
pip install -r config/requirements.txt

# Run
export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
python3 -m wayback_archive.cli

# Preview
cd output && python3 -m http.server 8000
# Open http://localhost:8000

Features

Core

Full website download -- HTML, CSS, JS, images, fonts, and all linked assets
Recursive link discovery -- Automatically follows links in HTML, CSS, and JS files
Smart URL rewriting -- Converts all links to relative paths for local serving
Timeframe fallback -- Searches nearby Wayback Machine timestamps when a resource returns 404
Real-time progress logging -- Displays download status and file processing as it happens

Asset Handling

Google Fonts support -- Downloads Google Fonts CSS and font files locally, fixing CORS issues
Font corruption detection -- Identifies and removes corrupted font files (HTML error pages served as fonts)
CDN fallback -- Automatic fallback to CDN for critical libraries (e.g., jQuery) when Wayback Machine fails
Data attribute processing -- Processes data-* attributes containing URLs (videos, images, etc.)

Preservation

Icon group preservation -- Preserves all links in icon groups (social media, contact icons)
Button link preservation -- Maintains styling and functionality of button links
Cookie consent preservation -- Keeps cookie consent popups and functionality intact

Optimization

HTML minification -- Uses minify-html (Python 3.14+ compatible)
JS/CSS minification -- Optional JavaScript and CSS minification via rjsmin and cssmin
Image compression -- Optional image optimization with Pillow
Tracker/ad removal -- Strips analytics, ads, and external iframes
Link cleanup -- Configurable external link removal with anchor preservation options
www/non-www normalization -- Normalize domain variations automatically

Why Wayback-Archive?

Capability	Wayback-Archive	wget	httrack
Wayback Machine URL rewriting	Yes	No	No
Wayback artifact cleanup	Yes	No	No
Timeframe fallback for 404s	Yes	No	No
Google Fonts localization	Yes	No	No
Font corruption detection	Yes	No	No
CDN fallback	Yes	No	No
HTML/CSS/JS minification	Yes	No	No
Tracker and ad removal	Yes	No	No
`data-*` attribute processing	Yes	No	No

General-purpose tools like wget --mirror or httrack can download live websites, but they do not understand Wayback Machine URL structures, cannot clean up archive artifacts, and lack the specialized asset recovery that Wayback-Archive provides.

Installation

Prerequisites

Python 3.8 or higher
pip

From Source

git clone https://github.com/GeiserX/Wayback-Archive.git
cd Wayback-Archive

# Optional: create a virtual environment
python3 -m venv venv
source venv/bin/activate  # macOS/Linux
# venv\Scripts\activate   # Windows

pip install -r config/requirements.txt

As a Package

cd Wayback-Archive
pip install -e .
wayback-archive  # Available as a CLI command after installation

Configuration

All options are set via environment variables. You can also use a .env file.

Required

Variable	Description
`WAYBACK_URL`	The Wayback Machine URL to download

Output

Variable	Default	Description
`OUTPUT_DIR`	`./output`	Output directory for downloaded files

Optimization

Variable	Default	Description
`OPTIMIZE_HTML`	`true`	Minify HTML
`OPTIMIZE_IMAGES`	`false`	Compress images
`MINIFY_JS`	`false`	Minify JavaScript
`MINIFY_CSS`	`false`	Minify CSS

Content Removal

Variable	Default	Description
`REMOVE_TRACKERS`	`true`	Remove analytics and trackers
`REMOVE_ADS`	`true`	Remove advertisements
`REMOVE_CLICKABLE_CONTACTS`	`true`	Remove `tel:` and `mailto:` links
`REMOVE_EXTERNAL_IFRAMES`	`false`	Remove external iframes

Link Handling

Variable	Default	Description
`REMOVE_EXTERNAL_LINKS_KEEP_ANCHORS`	`true`	Remove external links, keep anchor text
`REMOVE_EXTERNAL_LINKS_REMOVE_ANCHORS`	`false`	Remove external links and anchor elements
`MAKE_INTERNAL_LINKS_RELATIVE`	`true`	Convert internal links to relative paths

Domain

Variable	Default	Description
`MAKE_NON_WWW`	`true`	Convert www to non-www
`MAKE_WWW`	`false`	Convert non-www to www
`KEEP_REDIRECTIONS`	`false`	Keep redirect pages

Testing

Variable	Default	Description
`MAX_FILES`	unlimited	Limit number of files to download

Usage

macOS / Linux

export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export OUTPUT_DIR="./my_website"
export REMOVE_CLICKABLE_CONTACTS="false"  # Keep email/phone links

python3 -m wayback_archive.cli

Windows (PowerShell)

$env:WAYBACK_URL = "https://web.archive.org/web/20250417203037/http://example.com/"
$env:OUTPUT_DIR = ".\my_website"
$env:REMOVE_CLICKABLE_CONTACTS = "false"

python -m wayback_archive.cli

Windows (CMD)

set WAYBACK_URL=https://web.archive.org/web/20250417203037/http://example.com/
set OUTPUT_DIR=.\my_website
set REMOVE_CLICKABLE_CONTACTS=false

python -m wayback_archive.cli

Quick Test

Download a limited number of files to verify everything works:

export WAYBACK_URL="https://web.archive.org/web/20250417203037/http://example.com/"
export MAX_FILES=5
python3 -m wayback_archive.cli

How It Works

Initial download -- Fetches the main page from the Wayback Machine
Link extraction -- Parses HTML to find all referenced assets (links, images, CSS, JS)
CSS processing -- Extracts font URLs, background images, and @import statements; downloads Google Fonts locally; detects corrupted font files
JS processing -- Extracts dynamically loaded resources from JavaScript
Data attributes -- Scans data-* attributes for additional asset URLs
Iterative crawling -- Continues discovering and downloading resources until the queue is empty
Timeframe fallback -- For 404 responses, searches nearby Wayback Machine timestamps
URL rewriting -- Converts all URLs to relative paths for offline serving
Preservation -- Maintains icon groups, button links, and cookie consent functionality

Project Structure

Wayback-Archive/
  wayback_archive/          # Main package
    __init__.py
    __main__.py
    cli.py                  # CLI entry point
    config.py               # Environment variable configuration
    downloader.py           # Core download and processing engine
  config/
    requirements.txt        # Runtime dependencies
    requirements-dev.txt    # Development dependencies
    setup.py                # Package setup
    pytest.ini              # Test configuration
  tests/                    # Test suite
  docs/                     # Documentation
  LICENSE                   # GPL-3.0
  README.md

Testing

pip install -r config/requirements-dev.txt

# Run tests
pytest

# Run tests with coverage
pytest --cov=wayback_archive

Troubleshooting

Port Already in Use

python3 -m http.server 8080  # Use a different port

Font Loading Issues

Google Fonts: Downloaded automatically to avoid CORS issues
Corrupted fonts: Detected and removed from CSS automatically
Missing fonts: Some fonts may not exist in the Wayback Machine archive

See Font Loading Research Notes for details.

Missing Links or Icons

Icon groups (social media, contacts) are preserved automatically
Button links with sppb-btn or btn classes are preserved
Set REMOVE_CLICKABLE_CONTACTS=false to keep tel: and mailto: links

jQuery or Libraries Not Loading

The tool includes automatic CDN fallback for critical libraries. If a file fails to download from the Wayback Machine, it will attempt to fetch it from a CDN.

Dependencies

Package	Purpose
requests	HTTP client
beautifulsoup4	HTML parsing
lxml	Fast HTML/XML parser
minify-html	HTML minification
cssmin	CSS minification
rjsmin	JS minification
Pillow	Image optimization
python-dotenv	`.env` file support

Contributing

Contributions are welcome. Please feel free to submit a Pull Request.

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github		.github
config		config
docs		docs
tests		tests
wayback_archive		wayback_archive
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Features

Core

Asset Handling

Preservation

Optimization

Why Wayback-Archive?

Installation

Prerequisites

From Source

As a Package

Configuration

Required

Output

Optimization

Content Removal

Link Handling

Domain

Testing

Usage

macOS / Linux

Windows (PowerShell)

Windows (CMD)

Quick Test

How It Works

Project Structure

Testing

Troubleshooting

Port Already in Use

Font Loading Issues

Missing Links or Icons

jQuery or Libraries Not Loading

Dependencies

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages