Wayback When

Wayback When is a tool that crawls a website and saves its pages to the Internet Archive’s Wayback Machine. It uses a headless browser to load pages the same way a real visitor would, so it can find links that only appear after scripts run. As it crawls, it keeps track of every internal link it discovers. Before archiving anything, it checks when the page was last saved. If the page was archived recently, it skips it. If it hasn’t been saved in a while, it sends it to the Wayback Machine. The goal is to make website preservation easier, faster, and less repetitive. Instead of manually checking pages or wasting time on duplicates, Wayback When handles the crawling, the decision‑making, and the archiving for you.

Scraper

Wayback When uses a Selenium‑based scraper to explore a website and collect every link it can find. Instead of looking only at the raw HTML, it loads each page in a full browser environment, just like a real visitor. This allows it to find every link while remaining invisible to anti-scraping protections.

Archiver

The archiver decides which pages actually need to be saved. For every link the scraper finds, it checks the Wayback Machine to see when the page was last archived. If the snapshot is recent, it skips it. If it’s old or missing, it sends a new save request. It also handles rate limits and retries so the process can run for long periods without manual supervision.

V1.2 Release

New Additions and Enhancements in V1.2

Settings

Added a max_crawl_runtime setting
Added a max_archive_runtime setting
SETTINGS have been sorted alphabetically
Increased retries from 3 → 5
Reduced archive_timeout_seconds from 1200s → 300s
Removed deprecated max_archiving_queue_size (fully deprecated)

Added Features

Added Runtime to Archiving Summary
Added Progress Counter to the Archival Messages
Added new global runtime limits: max_crawl_runtime and max_archive_runtime
Added logging import for future structured logging support

Error Handling

Hid urllib3 Error Messages behind DEBUG_MODE
Hid WebDriver Error Messages behind DEBUG_MODE
Hid "Attempting to continue after automated wait..." behind DEBUG_MODE
Hid "Failed to retrieve {base_url} after {retries} attempts." behind DEBUG_MODE
Hid "CAPTCHA DETECTED for {base_url}. Waiting 5–10 seconds..." behind DEBUG_MODE
Archiving errors now fall under the retry variable
Improved timeout handling for archiving threads (now retries instead of immediate failure)

Bug Fixes

Fixed issue where "Finished processing {base_url}. Discovered {len(links)} links." would be shown as DEBUG instead of INFO
Fixed issue where URL Normalisation would add HTTP:// to FTP and RSYNC URLs, causing scraping issues
Fixed archiving timeout logic so that a timeout no longer permanently blocks the thread
Fixed inconsistent logging levels between crawler and archiver subsystems

Miscellaneous

Changed message from Adding initial URL to queues: to Starting with URLs:
Deprecated max_archiving_queue_size (now fully removed)
Minor internal refactors for clarity and consistency
Improved internal comments and documentation for maintainability

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
Docs		Docs
LICENSE		LICENSE
README.md		README.md
WaybackWhen.py		WaybackWhen.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wayback When

Scraper

Archiver

V1.2 Release

New Additions and Enhancements in V1.2

Settings

Added Features

Error Handling

Bug Fixes

Miscellaneous

Contributing

About

Uh oh!

Releases 4

Packages

Contributors 4

Uh oh!

Languages

License

GrainWare/WaybackWhen

Folders and files

Latest commit

History

Repository files navigation

Wayback When

Scraper

Archiver

V1.2 Release

New Additions and Enhancements in V1.2

Settings

Added Features

Error Handling

Bug Fixes

Miscellaneous

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 4

Uh oh!

Languages

Packages