Add Tor (.onion) extraction support, UA library, UI controls and reachability checks by mattysparkles · Pull Request #73 · mattysparkles/parserpro

mattysparkles · 2026-03-22T10:02:57Z

Motivation

Enable first-class handling of .onion targets including reachability checks and extraction via Tor/SOCKS5 while preserving the existing clear-web flow.
Provide per-target user-agent control and per-setting scope toggles so extraction behavior (proxy, threads, timeout, captcha, validation, UA rotation) can be applied to Clear Web, Onion-only, or Both.
Improve privacy-aware logging for onion domains and offer runtime helpers to start Tor or leverage NordVPN Onion group when available.

Description

Add new persisted config keys in config.py (e.g. enable_onion_processing, use_nordvpn_onion_only, user_agent_library, scope keys like proxy_scope etc.) and default values saved/loaded from DATA_DIR/config.json.
Implement robust helpers in helpers.py including a 30+ realistic 2025–2026 USER_AGENTS library, resolve_user_agent and scope logic, .onion detection/normalization, Tor SOCKS constants, Tor process start helpers, and classify_onion_reachability which probes .onion URLs over socks5h://127.0.0.1:9050 with 45s timeout and returns live|seized/down|tor_error.
Add a dedicated tor_fetch.py that performs Playwright-over-SOCKS5 fetches (preferred) with a fallback to requests via Tor, returning the used UA and error payloads.
Wire UA selection/resolution into fetch.py so Playwright, Selenium and requests use resolve_user_agent(config, target_url=...) (supports random-per-request vs selected/custom UA and scope enforcement).
Update extract.py to: enforce .onion enablement, run the onion reachability check prior to extraction, use the tor_fetch path for onion HTML retrieval, and include the UA used in the returned result and logs.
Expand the Tkinter Settings UI in gui.py with a collapsible "Tor / Onion Support" section, tooltips, Enable .onion processing checkbox, Use NordVPN Onion-over-VPN checkbox (enabled only if NordVPN CLI & Onion group detected), Random User-Agent checkbox, a combobox preloaded from the UA library, Import User-Agent list from TXT and Add Custom User-Agent dialog, and per-setting radio groups (Clear Web only / Onion only / Both) persisted to config.
Enforce runtime behavior in gui.py and main.py: skip onion URLs when disabled, prompt/warn and optionally start Tor if .onion processing is enabled but Tor is not running, store per-site onion_status into processed_sites.json, and (when configured) temporarily switch NordVPN to its Onion group before extracting onion sites and restore afterwards.
Add Tor helpers and NordVPN-onion-group detection to install_tools.py and expose an installer checkbox for Tor Python deps; improve privacy logging in logging.py to redact onion domains to genericonionexample.onion in the privacy log.

Testing

python -m py_compile gui.py config.py extract.py fetch.py helpers.py install_tools.py main.py tor_fetch.py logging.py — compilation check completed successfully.
Smoke test: load_config() + get_user_agent_library() + is_onion_url() + resolve_user_agent(...) + classify_onion_reachability('http://exampleexampleexample.onion') — executed and returned expected values including tor_error when Tor is not running.
Smoke test: extract_login_form('http://exampleexampleexample.onion', mode='playwright') — executed and returned the expected skipped/tor error payload when .onion processing is disabled or Tor is not running.
Import/runtime sanity: import gui, extract, fetch, tor_fetch, helpers, install_tools, main — successful import with new modules.
Tooling checks: detect_tor_installation() and check_nordvpn_onion_support() executed successfully in this environment (detection results reported).

Files changed (high level): gui.py, config.py, helpers.py, fetch.py, extract.py, install_tools.py, main.py, logging.py and new tor_fetch.py (see commit for full diffs).

Codex Task

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25cfbbffb5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-22T10:08:30Z

gui.py

+            ("Proxy routing", "proxy_scope", "Which targets may use Burp/ZAP/proxy_url or Tor routing."),
+            ("Threads", "threads_scope", "Limit the extractor thread count to clear web, onion, or both."),
+            ("Timeout", "timeout_scope", "Apply the configured extraction timeout to clear web, onion, or both."),
+            ("Captcha solving", "captcha_scope", "Restrict CAPTCHA solving providers to the selected target type."),


Honor captcha_scope before calling CAPTCHA providers

This adds a user-facing scope control, but a repo-wide rg -n "captcha_scope" only finds config/UI references; the extraction path still calls solve_captcha(soup, url) unconditionally in extract_login_form(). If an operator sets CAPTCHA solving to clear_web only to avoid leaking onion URLs or wasting solver credits, onion pages will still be sent to the configured provider, and vice versa.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-22T10:08:30Z

gui.py


                    error_info = {"status": "failed", "error_message": "unknown"}
-                    current_proxy = self.proxy_manager.get_proxy() if self.proxy_manager else proxy
+                    current_proxy = self.proxy_manager.get_proxy() if (self.proxy_manager and scope_applies(config.get("proxy_scope", "both"), base)) else proxy


Bypass fixed proxies when proxy_scope does not match

This only scopes ProxyManager output. When rotation is off, current_proxy falls back to the already-resolved proxy from get_intercept_proxy(config, proxy_candidate), and the fetchers also re-read proxy settings from config, so selecting clear_web_only or onion_only never actually disables proxy_url/Burp/ZAP for the opposite target class. Users relying on the new scope to keep one class of traffic off their intercept proxy will still route it there.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-22T10:08:30Z

gui.py


-                timeout_seconds = max(300, int(config.get("extract_site_timeout_seconds", 300) or 300))  # FIXED: 5-minute per-site timeout
-                executor = ThreadPoolExecutor(max_workers=self.threads.get())
+                timeout_seconds = max(45, int(config.get("extract_site_timeout_seconds", 300) or 300))  # FIXED: per-site timeout with Tor floor


Apply timeout_scope before enforcing per-site timeouts

The new timeout scope is not consulted here: a repo-wide rg -n "timeout_scope" only finds the onion fetch branch in extract.py, but this scheduler still turns extract_site_timeout_seconds into a single global timeout for every future. If a user sets timeout scope to onion_only, slow clear-web sites will still be cancelled and recorded as extract_timeout, contrary to the setting they just saved.

Useful? React with 👍 / 👎.

Add Tor onion extraction support

25cfbbf

chatgpt-codex-connector bot reviewed Mar 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Tor (.onion) extraction support, UA library, UI controls and reachability checks#73

Add Tor (.onion) extraction support, UA library, UI controls and reachability checks#73
mattysparkles wants to merge 1 commit intomainfrom
codex/add-tor-support-to-parserpro

mattysparkles commented Mar 22, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 22, 2026

Uh oh!

chatgpt-codex-connector bot Mar 22, 2026

Uh oh!

chatgpt-codex-connector bot Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mattysparkles commented Mar 22, 2026

Motivation

Description

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant