Skip to content

Add Tor (.onion) extraction support, UA library, UI controls and reachability checks#73

Open
mattysparkles wants to merge 1 commit intomainfrom
codex/add-tor-support-to-parserpro
Open

Add Tor (.onion) extraction support, UA library, UI controls and reachability checks#73
mattysparkles wants to merge 1 commit intomainfrom
codex/add-tor-support-to-parserpro

Conversation

@mattysparkles
Copy link
Copy Markdown
Owner

Motivation

  • Enable first-class handling of .onion targets including reachability checks and extraction via Tor/SOCKS5 while preserving the existing clear-web flow.
  • Provide per-target user-agent control and per-setting scope toggles so extraction behavior (proxy, threads, timeout, captcha, validation, UA rotation) can be applied to Clear Web, Onion-only, or Both.
  • Improve privacy-aware logging for onion domains and offer runtime helpers to start Tor or leverage NordVPN Onion group when available.

Description

  • Add new persisted config keys in config.py (e.g. enable_onion_processing, use_nordvpn_onion_only, user_agent_library, scope keys like proxy_scope etc.) and default values saved/loaded from DATA_DIR/config.json.
  • Implement robust helpers in helpers.py including a 30+ realistic 2025–2026 USER_AGENTS library, resolve_user_agent and scope logic, .onion detection/normalization, Tor SOCKS constants, Tor process start helpers, and classify_onion_reachability which probes .onion URLs over socks5h://127.0.0.1:9050 with 45s timeout and returns live|seized/down|tor_error.
  • Add a dedicated tor_fetch.py that performs Playwright-over-SOCKS5 fetches (preferred) with a fallback to requests via Tor, returning the used UA and error payloads.
  • Wire UA selection/resolution into fetch.py so Playwright, Selenium and requests use resolve_user_agent(config, target_url=...) (supports random-per-request vs selected/custom UA and scope enforcement).
  • Update extract.py to: enforce .onion enablement, run the onion reachability check prior to extraction, use the tor_fetch path for onion HTML retrieval, and include the UA used in the returned result and logs.
  • Expand the Tkinter Settings UI in gui.py with a collapsible "Tor / Onion Support" section, tooltips, Enable .onion processing checkbox, Use NordVPN Onion-over-VPN checkbox (enabled only if NordVPN CLI & Onion group detected), Random User-Agent checkbox, a combobox preloaded from the UA library, Import User-Agent list from TXT and Add Custom User-Agent dialog, and per-setting radio groups (Clear Web only / Onion only / Both) persisted to config.
  • Enforce runtime behavior in gui.py and main.py: skip onion URLs when disabled, prompt/warn and optionally start Tor if .onion processing is enabled but Tor is not running, store per-site onion_status into processed_sites.json, and (when configured) temporarily switch NordVPN to its Onion group before extracting onion sites and restore afterwards.
  • Add Tor helpers and NordVPN-onion-group detection to install_tools.py and expose an installer checkbox for Tor Python deps; improve privacy logging in logging.py to redact onion domains to genericonionexample.onion in the privacy log.

Testing

  • python -m py_compile gui.py config.py extract.py fetch.py helpers.py install_tools.py main.py tor_fetch.py logging.py — compilation check completed successfully.
  • Smoke test: load_config() + get_user_agent_library() + is_onion_url() + resolve_user_agent(...) + classify_onion_reachability('http://exampleexampleexample.onion') — executed and returned expected values including tor_error when Tor is not running.
  • Smoke test: extract_login_form('http://exampleexampleexample.onion', mode='playwright') — executed and returned the expected skipped/tor error payload when .onion processing is disabled or Tor is not running.
  • Import/runtime sanity: import gui, extract, fetch, tor_fetch, helpers, install_tools, main — successful import with new modules.
  • Tooling checks: detect_tor_installation() and check_nordvpn_onion_support() executed successfully in this environment (detection results reported).

Files changed (high level): gui.py, config.py, helpers.py, fetch.py, extract.py, install_tools.py, main.py, logging.py and new tor_fetch.py (see commit for full diffs).


Codex Task

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25cfbbffb5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

("Proxy routing", "proxy_scope", "Which targets may use Burp/ZAP/proxy_url or Tor routing."),
("Threads", "threads_scope", "Limit the extractor thread count to clear web, onion, or both."),
("Timeout", "timeout_scope", "Apply the configured extraction timeout to clear web, onion, or both."),
("Captcha solving", "captcha_scope", "Restrict CAPTCHA solving providers to the selected target type."),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor captcha_scope before calling CAPTCHA providers

This adds a user-facing scope control, but a repo-wide rg -n "captcha_scope" only finds config/UI references; the extraction path still calls solve_captcha(soup, url) unconditionally in extract_login_form(). If an operator sets CAPTCHA solving to clear_web only to avoid leaking onion URLs or wasting solver credits, onion pages will still be sent to the configured provider, and vice versa.

Useful? React with 👍 / 👎.


error_info = {"status": "failed", "error_message": "unknown"}
current_proxy = self.proxy_manager.get_proxy() if self.proxy_manager else proxy
current_proxy = self.proxy_manager.get_proxy() if (self.proxy_manager and scope_applies(config.get("proxy_scope", "both"), base)) else proxy
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bypass fixed proxies when proxy_scope does not match

This only scopes ProxyManager output. When rotation is off, current_proxy falls back to the already-resolved proxy from get_intercept_proxy(config, proxy_candidate), and the fetchers also re-read proxy settings from config, so selecting clear_web_only or onion_only never actually disables proxy_url/Burp/ZAP for the opposite target class. Users relying on the new scope to keep one class of traffic off their intercept proxy will still route it there.

Useful? React with 👍 / 👎.


timeout_seconds = max(300, int(config.get("extract_site_timeout_seconds", 300) or 300)) # FIXED: 5-minute per-site timeout
executor = ThreadPoolExecutor(max_workers=self.threads.get())
timeout_seconds = max(45, int(config.get("extract_site_timeout_seconds", 300) or 300)) # FIXED: per-site timeout with Tor floor
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply timeout_scope before enforcing per-site timeouts

The new timeout scope is not consulted here: a repo-wide rg -n "timeout_scope" only finds the onion fetch branch in extract.py, but this scheduler still turns extract_site_timeout_seconds into a single global timeout for every future. If a user sets timeout scope to onion_only, slow clear-web sites will still be cancelled and recorded as extract_timeout, contrary to the setting they just saved.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant