Add Tor (.onion) extraction support, UA library, UI controls and reachability checks#73
Add Tor (.onion) extraction support, UA library, UI controls and reachability checks#73mattysparkles wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25cfbbffb5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ("Proxy routing", "proxy_scope", "Which targets may use Burp/ZAP/proxy_url or Tor routing."), | ||
| ("Threads", "threads_scope", "Limit the extractor thread count to clear web, onion, or both."), | ||
| ("Timeout", "timeout_scope", "Apply the configured extraction timeout to clear web, onion, or both."), | ||
| ("Captcha solving", "captcha_scope", "Restrict CAPTCHA solving providers to the selected target type."), |
There was a problem hiding this comment.
Honor
captcha_scope before calling CAPTCHA providers
This adds a user-facing scope control, but a repo-wide rg -n "captcha_scope" only finds config/UI references; the extraction path still calls solve_captcha(soup, url) unconditionally in extract_login_form(). If an operator sets CAPTCHA solving to clear_web only to avoid leaking onion URLs or wasting solver credits, onion pages will still be sent to the configured provider, and vice versa.
Useful? React with 👍 / 👎.
|
|
||
| error_info = {"status": "failed", "error_message": "unknown"} | ||
| current_proxy = self.proxy_manager.get_proxy() if self.proxy_manager else proxy | ||
| current_proxy = self.proxy_manager.get_proxy() if (self.proxy_manager and scope_applies(config.get("proxy_scope", "both"), base)) else proxy |
There was a problem hiding this comment.
Bypass fixed proxies when
proxy_scope does not match
This only scopes ProxyManager output. When rotation is off, current_proxy falls back to the already-resolved proxy from get_intercept_proxy(config, proxy_candidate), and the fetchers also re-read proxy settings from config, so selecting clear_web_only or onion_only never actually disables proxy_url/Burp/ZAP for the opposite target class. Users relying on the new scope to keep one class of traffic off their intercept proxy will still route it there.
Useful? React with 👍 / 👎.
|
|
||
| timeout_seconds = max(300, int(config.get("extract_site_timeout_seconds", 300) or 300)) # FIXED: 5-minute per-site timeout | ||
| executor = ThreadPoolExecutor(max_workers=self.threads.get()) | ||
| timeout_seconds = max(45, int(config.get("extract_site_timeout_seconds", 300) or 300)) # FIXED: per-site timeout with Tor floor |
There was a problem hiding this comment.
Apply
timeout_scope before enforcing per-site timeouts
The new timeout scope is not consulted here: a repo-wide rg -n "timeout_scope" only finds the onion fetch branch in extract.py, but this scheduler still turns extract_site_timeout_seconds into a single global timeout for every future. If a user sets timeout scope to onion_only, slow clear-web sites will still be cancelled and recorded as extract_timeout, contrary to the setting they just saved.
Useful? React with 👍 / 👎.
Motivation
Description
config.py(e.g.enable_onion_processing,use_nordvpn_onion_only,user_agent_library, scope keys likeproxy_scopeetc.) and default values saved/loaded fromDATA_DIR/config.json.helpers.pyincluding a 30+ realistic 2025–2026USER_AGENTSlibrary,resolve_user_agentand scope logic,.oniondetection/normalization, Tor SOCKS constants, Tor process start helpers, andclassify_onion_reachabilitywhich probes.onionURLs oversocks5h://127.0.0.1:9050with 45s timeout and returnslive|seized/down|tor_error.tor_fetch.pythat performs Playwright-over-SOCKS5 fetches (preferred) with a fallback torequestsvia Tor, returning the used UA and error payloads.fetch.pyso Playwright, Selenium andrequestsuseresolve_user_agent(config, target_url=...)(supports random-per-request vs selected/custom UA and scope enforcement).extract.pyto: enforce.onionenablement, run the onion reachability check prior to extraction, use thetor_fetchpath for onion HTML retrieval, and include the UA used in the returnedresultand logs.gui.pywith a collapsible "Tor / Onion Support" section, tooltips,Enable .onion processingcheckbox,Use NordVPN Onion-over-VPNcheckbox (enabled only if NordVPN CLI & Onion group detected),Random User-Agentcheckbox, a combobox preloaded from the UA library,Import User-Agent list from TXTandAdd Custom User-Agentdialog, and per-setting radio groups (Clear Web only/Onion only/Both) persisted to config.gui.pyandmain.py: skip onion URLs when disabled, prompt/warn and optionally start Tor if.onionprocessing is enabled but Tor is not running, store per-siteonion_statusintoprocessed_sites.json, and (when configured) temporarily switch NordVPN to itsOniongroup before extracting onion sites and restore afterwards.install_tools.pyand expose an installer checkbox for Tor Python deps; improve privacy logging inlogging.pyto redact onion domains togenericonionexample.onionin the privacy log.Testing
python -m py_compile gui.py config.py extract.py fetch.py helpers.py install_tools.py main.py tor_fetch.py logging.py— compilation check completed successfully.load_config()+get_user_agent_library()+is_onion_url()+resolve_user_agent(...)+classify_onion_reachability('http://exampleexampleexample.onion')— executed and returned expected values includingtor_errorwhen Tor is not running.extract_login_form('http://exampleexampleexample.onion', mode='playwright')— executed and returned the expected skipped/tor error payload when.onionprocessing is disabled or Tor is not running.import gui, extract, fetch, tor_fetch, helpers, install_tools, main— successful import with new modules.detect_tor_installation()andcheck_nordvpn_onion_support()executed successfully in this environment (detection results reported).Files changed (high level):
gui.py,config.py,helpers.py,fetch.py,extract.py,install_tools.py,main.py,logging.pyand newtor_fetch.py(see commit for full diffs).Codex Task