Wayback When is a tool that crawls a website and saves its pages to the Internet Archive’s Wayback Machine. It uses a headless browser to load pages the same way a real visitor would, so it can find links that only appear after scripts run. As it crawls, it keeps track of every internal link it discovers. Before archiving anything, it checks when the page was last saved. If the page was archived recently, it skips it. If it hasn’t been saved in a while, it sends it to the Wayback Machine. The goal is to make website preservation easier, faster, and less repetitive. Instead of manually checking pages or wasting time on duplicates, Wayback When handles the crawling, the decision‑making, and the archiving for you.
Wayback When uses a Selenium‑based scraper to explore a website and collect every link it can find. Instead of looking only at the raw HTML, it loads each page in a full browser environment, just like a real visitor. This allows it to find every link while remaining invisible to anti-scraping protections.
The archiver decides which pages actually need to be saved. For every link the scraper finds, it checks the Wayback Machine to see when the page was last archived. If the snapshot is recent, it skips it. If it’s old or missing, it sends a new save request. It also handles rate limits and retries so the process can run for long periods without manual supervision.
- Added a
max_crawl_runtimesetting - Added a
max_archive_runtimesetting SETTINGShave been sorted alphabetically- Increased
retriesfrom 3 → 5 - Reduced
archive_timeout_secondsfrom 1200s → 300s - Removed deprecated
max_archiving_queue_size(fully deprecated)
- Added Runtime to Archiving Summary
- Added Progress Counter to the Archival Messages
- Added new global runtime limits:
max_crawl_runtimeandmax_archive_runtime - Added
loggingimport for future structured logging support
- Hid
urllib3Error Messages behindDEBUG_MODE - Hid
WebDriverError Messages behindDEBUG_MODE - Hid "Attempting to continue after automated wait..." behind
DEBUG_MODE - Hid "Failed to retrieve
{base_url}after{retries}attempts." behindDEBUG_MODE - Hid "CAPTCHA DETECTED for
{base_url}. Waiting 5–10 seconds..." behindDEBUG_MODE - Archiving errors now fall under the
retryvariable - Improved timeout handling for archiving threads (now retries instead of immediate failure)
- Fixed issue where "Finished processing
{base_url}. Discovered{len(links)}links." would be shown asDEBUGinstead ofINFO - Fixed issue where URL Normalisation would add
HTTP://to FTP and RSYNC URLs, causing scraping issues - Fixed archiving timeout logic so that a timeout no longer permanently blocks the thread
- Fixed inconsistent logging levels between crawler and archiver subsystems
- Changed message from
Adding initial URL to queues:toStarting with URLs: - Deprecated
max_archiving_queue_size(now fully removed) - Minor internal refactors for clarity and consistency
- Improved internal comments and documentation for maintainability
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.