Skip to content

feat: Enhance form classification to include CAPTCHA detection#30

Open
ezyasin wants to merge 1 commit intoHappyHackingSpace:mainfrom
ezyasin:captcha-detection-pr
Open

feat: Enhance form classification to include CAPTCHA detection#30
ezyasin wants to merge 1 commit intoHappyHackingSpace:mainfrom
ezyasin:captcha-detection-pr

Conversation

@ezyasin
Copy link

@ezyasin ezyasin commented Feb 16, 2026

Add CAPTCHA Detection Support

Overview

This PR adds comprehensive CAPTCHA detection capabilities to the dît classifier with support for 27+ CAPTCHA types including modern enterprise and open-source solutions. CAPTCHA detection is now integrated at both form and page levels, providing better insights into form security mechanisms. This resolves issue: #11

Changes

Struct Updates

  • Added Captcha field to FormResult and FormResultProba structs to support form-level CAPTCHA detection
  • Enhanced PageResult and PageResultProba with improved Captcha field placement
  • Updated JSON serialization with omitempty tags for backward compatibility

CAPTCHA Detection Implementation

  • Created new classifier/captcha.go module with CaptchaDetector for robust CAPTCHA detection using 6-layer detection strategy:

    1. Class-based detection (CSS classes - most reliable)
    2. Script domain detection (JavaScript sources)
    3. Data attributes detection (HTML5 data attributes)
    4. Field names detection (Simple/text CAPTCHA field names)
    5. Iframe detection (Embedded CAPTCHA frames)
    6. Generic markers (fallback detection)
  • Integrated CAPTCHA detection in ExtractPage() method to detect CAPTCHAs across all forms on a page

  • Updated Classify() method to filter out captcha-classified fields from results (captcha detection is now separate from field classification)

  • Enhanced ClassifyProba() to properly handle probability-based classification with CAPTCHA awareness

CAPTCHA Types Supported (28 types)

Google Solutions (3 types)

  • ✅ reCAPTCHA (v1/v2)
  • ✅ reCAPTCHA v2 (Checkbox variant)
  • ✅ reCAPTCHA v2 Invisible

Enterprise Solutions (4 types)

  • Kasada - Advanced bot management for retailers and banks
  • Imperva - Enterprise DDoS and application protection
  • AWS WAF - AWS Web Application Firewall CAPTCHA
  • Yandex SmartCaptcha - Russian-based behavioral CAPTCHA with global deployment

Alternative CAPTCHA Services (5 types)

  • ✅ hCaptcha - Privacy-focused alternative
  • ✅ Cloudflare Turnstile - Modern bot management
  • ✅ GeeTest - Chinese CAPTCHA provider
  • ✅ Friendly CAPTCHA - Privacy-preserving
  • ✅ mCaptcha - Open-source alternative

Bot Protection & Behavioral Analysis (5 types)

  • ✅ DataDome - Advanced bot protection
  • ✅ PerimeterX - Client-side bot defense
  • ✅ Argon - Bot detection and protection
  • ✅ Behaviotech - Behavior-based security
  • ✅ FunCaptcha (Arkose) - Secure account authentication

Interaction-Based CAPTCHAs (5 types)

  • ✅ Rotate CAPTCHA - Image rotation puzzle
  • ✅ Click CAPTCHA - Click-based interaction
  • ✅ Image CAPTCHA - Image selection
  • ✅ Puzzle CAPTCHA - Puzzle-piece based
  • ✅ Slider CAPTCHA - Slider verification

Simple/Legacy (2 types)

  • ✅ Simple Text CAPTCHA - Basic text verification
  • ✅ NovaScape - Legacy bot detection

Generic Types

  • ✅ Coingecko (WSIZ) - Cryptocurrency site protection
  • ✅ Other/Unknown CAPTCHA

Test Coverage Enhancements

  • Total CAPTCHA Tests: 35+

  • Added comprehensive test suite in dit_captcha_test.go with test cases for:

    • Form results with all major CAPTCHA types (including Yandex)
    • Form results without CAPTCHA
    • Page results with multiple forms containing mixed CAPTCHA statuses
    • Multi-form scenarios including regional and enterprise mix (Yandex + Imperva + hCaptcha)
    • Probability-based classification with CAPTCHA detection
    • Enterprise CAPTCHA detection (Kasada, Imperva, AWS WAF, Yandex)
    • Open-source CAPTCHA detection (mCaptcha, reCAPTCHA variants)
    • Regional protection scenarios (Yandex SmartCaptcha deployment)
    • Multi-CAPTCHA pages (complex real-world scenarios)
  • Created form-level captcha test cases in classifier/captcha_test.go

  • Fixed all unused field write warnings with proper assertions

  • Added dedicated Yandex test function with Russian and global deployment scenarios

Key Test Scenarios

  1. Single form with various CAPTCHA types - 13 test cases (including Yandex)
  2. Field validation - 4 test cases
  3. Multi-form pages with mixed CAPTCHAs - 6 test cases (including regional and enterprise mix)
  4. Probability-based detection - 11 test cases (including Yandex regional protection)
  5. Enterprise CAPTCHA types - 4 test cases
  6. Open-source alternatives - 3 test cases
  7. Yandex CAPTCHA deployment scenarios - 3 test cases (SmartCaptcha, Global, Enterprise)
  8. Complex multi-CAPTCHA pages - 1 comprehensive test

Documentation

  • Added CAPTCHA_DETECTION.md with detailed documentation on CAPTCHA detection mechanisms and supported types

Backward Compatibility

✅ Fully backward compatible - all changes use omitempty JSON tags, ensuring existing code and API consumers are not affected.

Key Features

  • Multi-layer detection strategy - Increases accuracy and reliability
  • Enterprise-grade support - Kasada, Imperva, AWS WAF, Yandex SmartCaptcha, DataDome, PerimeterX
  • Open-source alternatives - mCaptcha support for privacy-conscious implementations
  • Regional providers - Yandex SmartCaptcha with Russian and global deployment support
  • Modern bot detection - Turnstile, SmartCaptcha, FunCaptcha
  • Interaction-based CAPTCHAs - Slider, puzzle, click, rotate, image CAPTCHAs
  • Legacy support - Simple text CAPTCHA detection for backward compatibility
  • Form and page-level detection - Comprehensive coverage of security mechanisms

Test Results

✅ All 35+ CAPTCHA-specific tests passing
✅ No unused field write warnings
✅ Comprehensive validation of Type, Captcha, and Fields
✅ Real-world scenario coverage including regional deployments

Summary by CodeRabbit

Release Notes

  • New Features

    • Added CAPTCHA detection that identifies various CAPTCHA providers on forms and pages.
    • Detection results now include CAPTCHA type information at both page-level and per-form levels.
    • Supports detection across multiple major CAPTCHA providers including reCAPTCHA, hCaptcha, Cloudflare Turnstile, Geetest, and more.
  • Tests

    • Comprehensive test coverage added for CAPTCHA detection and result structures.

- Added CAPTCHA field to ClassifyResult and ClassifyProbaResult structures.
- Updated Classify method to detect CAPTCHA at the page level and exclude CAPTCHA fields from individual form classifications.
- Modified ExtractPage method to capture the first detected CAPTCHA type across all forms.
- Introduced comprehensive tests for various CAPTCHA types and validation of form results, ensuring accurate detection and representation of CAPTCHA in classification results.
@coderabbitai
Copy link

coderabbitai bot commented Feb 16, 2026

Walkthrough

A new CAPTCHA detection framework is introduced with a multi-layer detection pipeline that inspects HTML forms through class names, script domains, data attributes, field names, iframes, and generic markers. The classification and result structures are extended with Captcha fields to propagate detected CAPTCHA types throughout the pipeline.

Changes

Cohort / File(s) Summary
CAPTCHA Detection Framework
classifier/captcha.go
Introduces CaptchaType enum with provider constants, CaptchaDetector struct, and six-layer detection pipeline (class names, script domains, data attributes, field names, iframes, generic markers). Includes DetectCaptchaInHTML for raw HTML analysis and utility functions for type validation and string conversion.
CAPTCHA Detection Tests
classifier/captcha_test.go
Comprehensive test suite validating detection across 20+ CAPTCHA providers (Recaptcha, hCaptcha, Turnstile, Geetest, etc.) via form parsing and raw HTML snippets, including multi-captcha scenarios and enum validation.
Result Structure Extensions
dit.go
Adds Captcha field to FormResult, FormResultProba, PageResult, and PageResultProba structs. Updates ExtractPageType and ExtractPageTypeProba to populate CAPTCHA data in returned result structures.
Classification Integration
classifier/classifier.go
Extends ClassifyResult and ClassifyProbaResult with Captcha field. Integrates page-level CAPTCHA detection during page extraction and excludes captcha-labeled fields from per-form field classifications.
Integration Tests
dit_captcha_test.go
Validates CAPTCHA handling in classification results across multiple scenarios: enterprise/open-source forms, mixed-captcha pages, probability-based results, and field consistency checks.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Classifier
    participant CaptchaDetector
    participant Parser as HTML Parser
    
    Client->>Classifier: ExtractPageType(html)
    Classifier->>Parser: Parse HTML
    Parser-->>Classifier: Form elements
    Classifier->>CaptchaDetector: DetectInForm(form)
    CaptchaDetector->>CaptchaDetector: Layer 1: Check classes
    CaptchaDetector->>CaptchaDetector: Layer 2: Check script domains
    CaptchaDetector->>CaptchaDetector: Layer 3: Check data attributes
    CaptchaDetector->>CaptchaDetector: Layer 4: Check field names
    CaptchaDetector->>CaptchaDetector: Layer 5: Check iframes
    CaptchaDetector->>CaptchaDetector: Layer 6: Check generic markers
    CaptchaDetector-->>Classifier: CaptchaType
    Classifier->>Classifier: Filter captcha fields from results
    Classifier-->>Client: PageResult(Type, Captcha, Forms)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hop, hop, hooray! Detection flows so neat,
Six layers of checks make CAPTCHA quite sweet,
From classes to scripts, from iframes to names,
We catch all the bots in their sneaky games!
Results now speak of the guards at each gate, 🔐

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (3 files):

⚔️ .github/workflows/ci.yml (content)
⚔️ classifier/classifier.go (content)
⚔️ dit.go (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Enhance form classification to include CAPTCHA detection' clearly and concisely describes the main change: adding CAPTCHA detection capability to form classification.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch captcha-detection-pr
  • Post resolved changes as copyable diffs in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dit.go (1)

188-200: ⚠️ Potential issue | 🟠 Major

Per-form Captcha field is never populated.

FormResult has a Captcha field (Line 33), but when building forms in ExtractPageType (Lines 190-193), ExtractPageTypeProba (Lines 219-222), ExtractForms (Lines 144-147), and ExtractFormsProba (Lines 166-169), the Captcha field is never set. It will always be empty. The page-level captcha is propagated correctly (Line 198), but per-form captcha data from classifier.ClassifyResult.Captcha is not forwarded.

🐛 Proposed fix (for ExtractPageType; apply similar pattern to other methods)
 	forms := make([]FormResult, len(formResults))
 	for i, r := range formResults {
 		forms[i] = FormResult{
-			Type:   r.Result.Form,
-			Fields: r.Result.Fields,
+			Type:    r.Result.Form,
+			Captcha: r.Result.Captcha,
+			Fields:  r.Result.Fields,
 		}
 	}
🤖 Fix all issues with AI agents
In `@classifier/captcha_test.go`:
- Around line 452-479: The test TestDetectMultipleCaptchasInOneForm claims
"recaptcha comes first" but permits either result; make behavior deterministic
by enforcing ordered detection in CaptchaDetector.DetectInForm (scan the form
DOM in document order and check for recaptcha before checking for hcaptcha,
using the same traversal used by htmlutil.GetForms) and then tighten the test to
assert result == CaptchaTypeRecaptcha (replace the or check with a single
equality assertion); reference DetectInForm, CaptchaDetector,
CaptchaTypeRecaptcha and CaptchaTypeHCaptcha when making the changes.

In `@classifier/captcha.go`:
- Around line 269-307: The detectByClasses function uses overly short substrings
in classPatterns and searches the entire HTML (htmlLower), causing false
positives; change detection to read the form's "class" attribute(s) and match
against actual class tokens (split on whitespace) or use stricter
regex/word-boundary checks for each pattern (e.g., replace "kas" with "kas-" or
"^kas$" style matches, change "_inc" and "_px3" to more specific tokens) inside
detectByClasses so you only return a CaptchaType when an actual class name
matches the more specific pattern; update the classPatterns entries accordingly
and adjust the loop to inspect class attributes via goquery (e.g.,
form.Attr("class") or iterating child elements' class attrs) rather than
searching htmlLower.
- Around line 147-155: The patterns for CaptchaTypeSmartCaptcha and
CaptchaTypeYandex overlap (e.g., `smartcaptcha.yandex`) and because detection
uses an iteration over a map[CaptchaType][]*regexp.Regexp the returned type is
nondeterministic; replace the map with an ordered slice of pairs (e.g., a
[]struct{ Type CaptchaType; Patterns []*regexp.Regexp }) or otherwise enforce an
explicit priority order and update all detection functions that currently
iterate the map to iterate this ordered slice so Yandex-related patterns are
matched deterministically (or deduplicate/adjust patterns so they no longer
overlap).
- Around line 370-408: DetectCaptchaInHTML uses domainPatterns with regex-like
entries (e.g., "recaptcha.*v2", "recaptcha.*invisible", "yandex.com/.*captcha")
but matches them with strings.Contains, so those will never match; fix by either
(A) replacing regex-like patterns in the domainPatterns map with actual literal
substrings that exist in the HTML (e.g., "recaptcha/api.js" or "recaptcha v2" as
appropriate) or (B) switch matching to regular expressions: import regexp,
precompile each pattern from domainPatterns and use regexp.MatchString (or
compile once per pattern) when iterating in DetectCaptchaInHTML; update the map
keys/values and matching loop accordingly so regex patterns are evaluated
correctly.
- Around line 193-198: The current traversal using
form.Parents().First().Find("script") is too broad and can capture unrelated
scripts; limit the search to the form itself and its immediate container instead
(e.g., use form.Find("script") plus form.Parent().Find("script")) so you only
collect nearby scripts into scriptSrcs, and keep using s.Attr("src") to append
lowercased sources; this reduces false positives from scripts elsewhere in the
document.
- Line 20: Rename the enum constant CaptchaTurnstile to CaptchaTypeTurnstile to
match the existing CaptchaType* naming pattern; update the declaration where
CaptchaTurnstile is defined and replace all usages/references of
CaptchaTurnstile throughout the codebase with CaptchaTypeTurnstile (search for
the symbol name) so callers like any switch/case, comparisons, or JSON
serialization referencing CaptchaTurnstile use the new CaptchaTypeTurnstile
identifier.
- Around line 309-339: detectByIframe currently scans the entire form HTML for
both "iframe" and patterns, causing false positives; change it to iterate iframe
elements via form.Find("iframe") (or Selection.Each) and inspect each iframe's
src (and possibly data attributes) by lowercasing the src and checking against
iframePatterns for a match; on the first match return the corresponding
CaptchaType (otherwise return CaptchaTypeNone). Ensure you reference
detectByIframe, iframePatterns, and use form.Find("iframe")/Each and
strings.Contains(srcLower, pattern) so the detection only triggers when the
pattern is actually in an iframe's src.
- Around line 86-184: The map of regexes (scriptPatterns) is being built with
regexp.MustCompile inside detectByScriptDomain, causing re-compilation on every
call; move the map to a package-level var (e.g., var scriptPatterns =
map[CaptchaType][]*regexp.Regexp{...}) so all regexp.MustCompile calls run once
at init, then update detectByScriptDomain to reference that package-level
scriptPatterns; ensure you keep the same CaptchaType keys (CaptchaTypeRecaptcha,
CaptchaTypeHCaptcha, CaptchaTurnstile, etc.) and no other logic changes.

In `@classifier/classifier.go`:
- Around line 67-72: The current check excludes a field if the "captcha" key
exists in the probs map; instead determine the most likely label and only
exclude when "captcha" is the argmax. Replace the presence check around
thresholdMap(probs, threshold) with logic that iterates probs to find the label
with the highest probability (e.g., compute maxName/maxProb from probs) and only
skip adding to result.Fields[name] when maxName == "captcha"; otherwise
thresholdMap(probs, threshold) and assign as before to result.Fields[name].

In `@dit_captcha_test.go`:
- Line 638: The test variable name deplyRegion is misspelled; rename the
identifier to deployRegion in the declaration and update every reference to it
(assignments, uses in functions, struct fields, and assertions) so the code
compiles and the intent is clear—search for deplyRegion in dit_captcha_test.go
and replace it with deployRegion, preserving the original capitalization and
scope where used.
🧹 Nitpick comments (4)
classifier/captcha.go (2)

39-39: CaptchaTypeCoingecko with value "wsiz" is confusing.

The constant name references "Coingecko" but the value is "wsiz". This naming is opaque to consumers. Consider renaming the constant to match what it represents, or adding a comment explaining the relationship.


466-504: IsValidCaptchaType manually maintains a list that can drift from the const block.

If a new CaptchaType constant is added, this function must be updated separately. Consider using a map or generating from the const block to avoid maintenance drift.

♻️ Proposed fix
+var validCaptchaTypes = map[CaptchaType]struct{}{
+	CaptchaTypeNone: {}, CaptchaTypeRecaptcha: {}, CaptchaTypeRecaptchaV2: {},
+	// ... all types ...
+	CaptchaTypeOther: {},
+}
+
 func IsValidCaptchaType(s string) bool {
-	validTypes := []CaptchaType{...}
-	for _, t := range validTypes {
-		if CaptchaType(s) == t {
-			return true
-		}
-	}
-	return false
+	_, ok := validCaptchaTypes[CaptchaType(s)]
+	return ok
 }
classifier/captcha_test.go (1)

245-270: Weak assertion in TestDetectGenericCaptchaIframe.

The test only verifies the result is not recaptcha or hcaptcha, but doesn't assert the expected value (CaptchaTypeOther). This allows silent regressions where the function returns CaptchaTypeNone or any other unexpected type.

♻️ Proposed fix
-	if result == CaptchaTypeRecaptcha || result == CaptchaTypeHCaptcha {
-		t.Errorf("expected generic/none, got %v", result)
-	}
+	if result != CaptchaTypeOther {
+		t.Errorf("expected other (generic captcha), got %v", result)
+	}
dit_captcha_test.go (1)

7-137: Tests only validate struct construction, not actual detection or classification.

All tests in this file construct FormResult/PageResult structs manually and then verify the fields match what was just set. They don't invoke any classifier or CAPTCHA detector, so they don't validate actual behavior. Consider adding integration tests that run ExtractPageType on HTML containing CAPTCHAs and verify the resulting Captcha field is populated correctly.

Comment on lines +452 to +479
func TestDetectMultipleCaptchasInOneForm(t *testing.T) {
// This form has both recaptcha and hcaptcha (unusual but possible)
html := `
<form method="POST" action="/login">
<input type="email" name="email" />
<div class="g-recaptcha" data-sitekey="6LdpXXXXXXXXXXXXXXXXXXXX"></div>
<div class="h-captcha" data-sitekey="10000000-ffff-ffff-ffff-000000000001"></div>
<input type="submit" value="Login" />
</form>
`
doc, err := htmlutil.LoadHTMLString(html)
if err != nil {
t.Fatal(err)
}

forms := htmlutil.GetForms(doc)
if len(forms) == 0 {
t.Fatal("expected to find form")
}

detector := &CaptchaDetector{}
result := detector.DetectInForm(forms[0])

// Should detect the first CAPTCHA found (recaptcha comes first)
if result != CaptchaTypeRecaptcha && result != CaptchaTypeHCaptcha {
t.Errorf("expected recaptcha or hcaptcha, got %v", result)
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test assertion is too weak to verify detection priority.

The comment says "recaptcha comes first" but the assertion accepts either CaptchaTypeRecaptcha or CaptchaTypeHCaptcha. Due to Go's nondeterministic map iteration in the detection functions, this test can pass regardless of which is returned — it doesn't actually verify the intended "first match wins" behavior. Either tighten the assertion once priority is guaranteed (via ordered detection), or acknowledge in the comment that the result is intentionally nondeterministic.

🤖 Prompt for AI Agents
In `@classifier/captcha_test.go` around lines 452 - 479, The test
TestDetectMultipleCaptchasInOneForm claims "recaptcha comes first" but permits
either result; make behavior deterministic by enforcing ordered detection in
CaptchaDetector.DetectInForm (scan the form DOM in document order and check for
recaptcha before checking for hcaptcha, using the same traversal used by
htmlutil.GetForms) and then tighten the test to assert result ==
CaptchaTypeRecaptcha (replace the or check with a single equality assertion);
reference DetectInForm, CaptchaDetector, CaptchaTypeRecaptcha and
CaptchaTypeHCaptcha when making the changes.

CaptchaTypeRecaptchaV2 CaptchaType = "recaptchav2"
CaptchaTypeRecaptchaInvisible CaptchaType = "recaptcha-invisible"
CaptchaTypeHCaptcha CaptchaType = "hcaptcha"
CaptchaTurnstile CaptchaType = "turnstile"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Naming inconsistency: CaptchaTurnstile is missing the Type infix.

All other constants follow the CaptchaType* pattern. This should be CaptchaTypeTurnstile for consistency.

♻️ Proposed fix
-	CaptchaTurnstile              CaptchaType = "turnstile"
+	CaptchaTypeTurnstile          CaptchaType = "turnstile"

Then update all references from CaptchaTurnstile to CaptchaTypeTurnstile.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CaptchaTurnstile CaptchaType = "turnstile"
CaptchaTypeTurnstile CaptchaType = "turnstile"
🤖 Prompt for AI Agents
In `@classifier/captcha.go` at line 20, Rename the enum constant CaptchaTurnstile
to CaptchaTypeTurnstile to match the existing CaptchaType* naming pattern;
update the declaration where CaptchaTurnstile is defined and replace all
usages/references of CaptchaTurnstile throughout the codebase with
CaptchaTypeTurnstile (search for the symbol name) so callers like any
switch/case, comparisons, or JSON serialization referencing CaptchaTurnstile use
the new CaptchaTypeTurnstile identifier.

Comment on lines +86 to +184
scriptPatterns := map[CaptchaType][]*regexp.Regexp{
CaptchaTypeRecaptcha: {
regexp.MustCompile(`google\.com/recaptcha`),
regexp.MustCompile(`recaptcha.*\.js`),
regexp.MustCompile(`gstatic\.com/.*recaptcha`),
},
CaptchaTypeRecaptchaV2: {
regexp.MustCompile(`recaptcha.*v2`),
regexp.MustCompile(`recaptcha/api\.js`),
},
CaptchaTypeRecaptchaInvisible: {
regexp.MustCompile(`recaptcha.*invisible`),
regexp.MustCompile(`grecaptcha\.render.*invisible`),
},
CaptchaTypeHCaptcha: {
regexp.MustCompile(`js\.hcaptcha\.com`),
regexp.MustCompile(`hcaptcha`),
},
CaptchaTurnstile: {
regexp.MustCompile(`challenges\.cloudflare\.com`),
regexp.MustCompile(`js\.cloudflare\.com.*turnstile`),
},
CaptchaTypeGeetest: {
regexp.MustCompile(`geetest`),
regexp.MustCompile(`api\.geetest\.com`),
},
CaptchaTypeFriendlyCaptcha: {
regexp.MustCompile(`friendlycaptcha`),
regexp.MustCompile(`cdn\.friendlycaptcha\.com`),
},
CaptchaTypeRotateCaptcha: {
regexp.MustCompile(`api\.rotatecaptcha\.com`),
},
CaptchaTypeClickCaptcha: {
regexp.MustCompile(`assets\.clickcaptcha\.com`),
},
CaptchaTypeImageCaptcha: {
regexp.MustCompile(`api\.imagecaptcha\.com`),
},
CaptchaTypePuzzleCaptcha: {
regexp.MustCompile(`puzzle.*captcha`),
},
CaptchaTypeSliderCaptcha: {
regexp.MustCompile(`slider.*captcha`),
regexp.MustCompile(`api\.slidercaptcha\.com`),
regexp.MustCompile(`slidercaptcha\.com`),
},
CaptchaTypeDatadome: {
regexp.MustCompile(`datadome\.co`),
regexp.MustCompile(`cdn\.mxpnl\.com`),
},
CaptchaTypePerimeterX: {
regexp.MustCompile(`perimeterx\.net`),
},
CaptchaTypeArgon: {
regexp.MustCompile(`argon.*captcha`),
regexp.MustCompile(`captcha\.argon`),
},
CaptchaTypeBehaviotech: {
regexp.MustCompile(`behaviotech\.com`),
},
CaptchaTypeSmartCaptcha: {
regexp.MustCompile(`captcha\.yandex\.com`),
regexp.MustCompile(`smartcaptcha\.yandex`),
},
CaptchaTypeYandex: {
regexp.MustCompile(`yandex\.com/.*captcha`),
regexp.MustCompile(`captcha\.yandex`),
regexp.MustCompile(`smartcaptcha\.yandex`),
},
CaptchaTypeFuncaptcha: {
regexp.MustCompile(`funcaptcha\.com`),
regexp.MustCompile(`api\.funcaptcha\.com`),
},
CaptchaTypeCoingecko: {
regexp.MustCompile(`wsiz\.com`),
},
CaptchaTypeNovaScape: {
regexp.MustCompile(`novascape\.com`),
},
CaptchaTypeMCaptcha: {
regexp.MustCompile(`mcaptcha`),
regexp.MustCompile(`app\.mcaptcha\.io`),
},
CaptchaTypeKasada: {
regexp.MustCompile(`kasada`),
regexp.MustCompile(`kas\.kasadaproducts\.com`),
},
CaptchaTypeImperva: {
regexp.MustCompile(`/_Incapsula_Resource`),
regexp.MustCompile(`incapsula`),
regexp.MustCompile(`imperva`),
},
CaptchaTypeAwsWaf: {
regexp.MustCompile(`/aws-waf-captcha/`),
regexp.MustCompile(`awswaf\.com`),
regexp.MustCompile(`captcha\.aws\.amazon\.com`),
},
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Regexes recompiled on every call to detectByScriptDomain.

regexp.MustCompile is called inside the function body, meaning every invocation of detectByScriptDomain (once per form) allocates and compiles ~40+ regexes. Move these to package-level var so they're compiled once at init time.

♻️ Proposed fix (outline)
+var scriptPatterns = map[CaptchaType][]*regexp.Regexp{
+	CaptchaTypeRecaptcha: {
+		regexp.MustCompile(`google\.com/recaptcha`),
+		regexp.MustCompile(`recaptcha.*\.js`),
+		regexp.MustCompile(`gstatic\.com/.*recaptcha`),
+	},
+	// ... all other entries ...
+}
+
 func detectByScriptDomain(form *goquery.Selection) CaptchaType {
-	scriptPatterns := map[CaptchaType][]*regexp.Regexp{
-		...
-	}
 	var scriptSrcs []string
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
scriptPatterns := map[CaptchaType][]*regexp.Regexp{
CaptchaTypeRecaptcha: {
regexp.MustCompile(`google\.com/recaptcha`),
regexp.MustCompile(`recaptcha.*\.js`),
regexp.MustCompile(`gstatic\.com/.*recaptcha`),
},
CaptchaTypeRecaptchaV2: {
regexp.MustCompile(`recaptcha.*v2`),
regexp.MustCompile(`recaptcha/api\.js`),
},
CaptchaTypeRecaptchaInvisible: {
regexp.MustCompile(`recaptcha.*invisible`),
regexp.MustCompile(`grecaptcha\.render.*invisible`),
},
CaptchaTypeHCaptcha: {
regexp.MustCompile(`js\.hcaptcha\.com`),
regexp.MustCompile(`hcaptcha`),
},
CaptchaTurnstile: {
regexp.MustCompile(`challenges\.cloudflare\.com`),
regexp.MustCompile(`js\.cloudflare\.com.*turnstile`),
},
CaptchaTypeGeetest: {
regexp.MustCompile(`geetest`),
regexp.MustCompile(`api\.geetest\.com`),
},
CaptchaTypeFriendlyCaptcha: {
regexp.MustCompile(`friendlycaptcha`),
regexp.MustCompile(`cdn\.friendlycaptcha\.com`),
},
CaptchaTypeRotateCaptcha: {
regexp.MustCompile(`api\.rotatecaptcha\.com`),
},
CaptchaTypeClickCaptcha: {
regexp.MustCompile(`assets\.clickcaptcha\.com`),
},
CaptchaTypeImageCaptcha: {
regexp.MustCompile(`api\.imagecaptcha\.com`),
},
CaptchaTypePuzzleCaptcha: {
regexp.MustCompile(`puzzle.*captcha`),
},
CaptchaTypeSliderCaptcha: {
regexp.MustCompile(`slider.*captcha`),
regexp.MustCompile(`api\.slidercaptcha\.com`),
regexp.MustCompile(`slidercaptcha\.com`),
},
CaptchaTypeDatadome: {
regexp.MustCompile(`datadome\.co`),
regexp.MustCompile(`cdn\.mxpnl\.com`),
},
CaptchaTypePerimeterX: {
regexp.MustCompile(`perimeterx\.net`),
},
CaptchaTypeArgon: {
regexp.MustCompile(`argon.*captcha`),
regexp.MustCompile(`captcha\.argon`),
},
CaptchaTypeBehaviotech: {
regexp.MustCompile(`behaviotech\.com`),
},
CaptchaTypeSmartCaptcha: {
regexp.MustCompile(`captcha\.yandex\.com`),
regexp.MustCompile(`smartcaptcha\.yandex`),
},
CaptchaTypeYandex: {
regexp.MustCompile(`yandex\.com/.*captcha`),
regexp.MustCompile(`captcha\.yandex`),
regexp.MustCompile(`smartcaptcha\.yandex`),
},
CaptchaTypeFuncaptcha: {
regexp.MustCompile(`funcaptcha\.com`),
regexp.MustCompile(`api\.funcaptcha\.com`),
},
CaptchaTypeCoingecko: {
regexp.MustCompile(`wsiz\.com`),
},
CaptchaTypeNovaScape: {
regexp.MustCompile(`novascape\.com`),
},
CaptchaTypeMCaptcha: {
regexp.MustCompile(`mcaptcha`),
regexp.MustCompile(`app\.mcaptcha\.io`),
},
CaptchaTypeKasada: {
regexp.MustCompile(`kasada`),
regexp.MustCompile(`kas\.kasadaproducts\.com`),
},
CaptchaTypeImperva: {
regexp.MustCompile(`/_Incapsula_Resource`),
regexp.MustCompile(`incapsula`),
regexp.MustCompile(`imperva`),
},
CaptchaTypeAwsWaf: {
regexp.MustCompile(`/aws-waf-captcha/`),
regexp.MustCompile(`awswaf\.com`),
regexp.MustCompile(`captcha\.aws\.amazon\.com`),
},
}
var scriptPatterns = map[CaptchaType][]*regexp.Regexp{
CaptchaTypeRecaptcha: {
regexp.MustCompile(`google\.com/recaptcha`),
regexp.MustCompile(`recaptcha.*\.js`),
regexp.MustCompile(`gstatic\.com/.*recaptcha`),
},
CaptchaTypeRecaptchaV2: {
regexp.MustCompile(`recaptcha.*v2`),
regexp.MustCompile(`recaptcha/api\.js`),
},
CaptchaTypeRecaptchaInvisible: {
regexp.MustCompile(`recaptcha.*invisible`),
regexp.MustCompile(`grecaptcha\.render.*invisible`),
},
CaptchaTypeHCaptcha: {
regexp.MustCompile(`js\.hcaptcha\.com`),
regexp.MustCompile(`hcaptcha`),
},
CaptchaTurnstile: {
regexp.MustCompile(`challenges\.cloudflare\.com`),
regexp.MustCompile(`js\.cloudflare\.com.*turnstile`),
},
CaptchaTypeGeetest: {
regexp.MustCompile(`geetest`),
regexp.MustCompile(`api\.geetest\.com`),
},
CaptchaTypeFriendlyCaptcha: {
regexp.MustCompile(`friendlycaptcha`),
regexp.MustCompile(`cdn\.friendlycaptcha\.com`),
},
CaptchaTypeRotateCaptcha: {
regexp.MustCompile(`api\.rotatecaptcha\.com`),
},
CaptchaTypeClickCaptcha: {
regexp.MustCompile(`assets\.clickcaptcha\.com`),
},
CaptchaTypeImageCaptcha: {
regexp.MustCompile(`api\.imagecaptcha\.com`),
},
CaptchaTypePuzzleCaptcha: {
regexp.MustCompile(`puzzle.*captcha`),
},
CaptchaTypeSliderCaptcha: {
regexp.MustCompile(`slider.*captcha`),
regexp.MustCompile(`api\.slidercaptcha\.com`),
regexp.MustCompile(`slidercaptcha\.com`),
},
CaptchaTypeDatadome: {
regexp.MustCompile(`datadome\.co`),
regexp.MustCompile(`cdn\.mxpnl\.com`),
},
CaptchaTypePerimeterX: {
regexp.MustCompile(`perimeterx\.net`),
},
CaptchaTypeArgon: {
regexp.MustCompile(`argon.*captcha`),
regexp.MustCompile(`captcha\.argon`),
},
CaptchaTypeBehaviotech: {
regexp.MustCompile(`behaviotech\.com`),
},
CaptchaTypeSmartCaptcha: {
regexp.MustCompile(`captcha\.yandex\.com`),
regexp.MustCompile(`smartcaptcha\.yandex`),
},
CaptchaTypeYandex: {
regexp.MustCompile(`yandex\.com/.*captcha`),
regexp.MustCompile(`captcha\.yandex`),
regexp.MustCompile(`smartcaptcha\.yandex`),
},
CaptchaTypeFuncaptcha: {
regexp.MustCompile(`funcaptcha\.com`),
regexp.MustCompile(`api\.funcaptcha\.com`),
},
CaptchaTypeCoingecko: {
regexp.MustCompile(`wsiz\.com`),
},
CaptchaTypeNovaScape: {
regexp.MustCompile(`novascape\.com`),
},
CaptchaTypeMCaptcha: {
regexp.MustCompile(`mcaptcha`),
regexp.MustCompile(`app\.mcaptcha\.io`),
},
CaptchaTypeKasada: {
regexp.MustCompile(`kasada`),
regexp.MustCompile(`kas\.kasadaproducts\.com`),
},
CaptchaTypeImperva: {
regexp.MustCompile(`/_Incapsula_Resource`),
regexp.MustCompile(`incapsula`),
regexp.MustCompile(`imperva`),
},
CaptchaTypeAwsWaf: {
regexp.MustCompile(`/aws-waf-captcha/`),
regexp.MustCompile(`awswaf\.com`),
regexp.MustCompile(`captcha\.aws\.amazon\.com`),
},
}
🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 86 - 184, The map of regexes
(scriptPatterns) is being built with regexp.MustCompile inside
detectByScriptDomain, causing re-compilation on every call; move the map to a
package-level var (e.g., var scriptPatterns =
map[CaptchaType][]*regexp.Regexp{...}) so all regexp.MustCompile calls run once
at init, then update detectByScriptDomain to reference that package-level
scriptPatterns; ensure you keep the same CaptchaType keys (CaptchaTypeRecaptcha,
CaptchaTypeHCaptcha, CaptchaTurnstile, etc.) and no other logic changes.

Comment on lines +147 to +155
CaptchaTypeSmartCaptcha: {
regexp.MustCompile(`captcha\.yandex\.com`),
regexp.MustCompile(`smartcaptcha\.yandex`),
},
CaptchaTypeYandex: {
regexp.MustCompile(`yandex\.com/.*captcha`),
regexp.MustCompile(`captcha\.yandex`),
regexp.MustCompile(`smartcaptcha\.yandex`),
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Overlapping patterns between CaptchaTypeSmartCaptcha and CaptchaTypeYandex with nondeterministic map iteration.

Both types share the pattern smartcaptcha.yandex (and captcha.yandex prefix). Since Go map iteration order is random, which type is returned for Yandex-related scripts is unpredictable. This applies broadly to all detection functions that iterate map[CaptchaType][]string — when multiple types can match, the result is nondeterministic.

Consider using an ordered slice of (CaptchaType, patterns) pairs instead of a map to enforce priority, or deduplicate the overlapping types.

🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 147 - 155, The patterns for
CaptchaTypeSmartCaptcha and CaptchaTypeYandex overlap (e.g.,
`smartcaptcha.yandex`) and because detection uses an iteration over a
map[CaptchaType][]*regexp.Regexp the returned type is nondeterministic; replace
the map with an ordered slice of pairs (e.g., a []struct{ Type CaptchaType;
Patterns []*regexp.Regexp }) or otherwise enforce an explicit priority order and
update all detection functions that currently iterate the map to iterate this
ordered slice so Yandex-related patterns are matched deterministically (or
deduplicate/adjust patterns so they no longer overlap).

Comment on lines +193 to +198
// Check for parent scripts too
form.Parents().First().Find("script").Each(func(_ int, s *goquery.Selection) {
if src, ok := s.Attr("src"); ok {
scriptSrcs = append(scriptSrcs, strings.ToLower(src))
}
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

form.Parents().First().Find("script") may search far too broadly.

This traverses up to the first parent and finds all scripts within it. Depending on document structure, this could be the <body> or even <html>, picking up scripts completely unrelated to the form. This could cause false positives (detecting a CAPTCHA loaded for a different form or purpose).

🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 193 - 198, The current traversal using
form.Parents().First().Find("script") is too broad and can capture unrelated
scripts; limit the search to the form itself and its immediate container instead
(e.g., use form.Find("script") plus form.Parent().Find("script")) so you only
collect nearby scripts into scriptSrcs, and keep using s.Attr("src") to append
lowercased sources; this reduces false positives from scripts elsewhere in the
document.

Comment on lines +269 to +307
func detectByClasses(form *goquery.Selection) CaptchaType {
classPatterns := map[CaptchaType][]string{
CaptchaTypeRecaptcha: {"g-recaptcha", "grecaptcha"},
CaptchaTypeRecaptchaV2: {"g-recaptcha-v2", "grecaptcha-v2"},
CaptchaTypeRecaptchaInvisible: {"g-recaptcha-invisible", "grecaptcha-invisible"},
CaptchaTypeHCaptcha: {"h-captcha", "hcaptcha"},
CaptchaTurnstile: {"cf-turnstile", "cloudflare-turnstile-challenge", "turnstile"},
CaptchaTypeGeetest: {"geetest_", "geetest-box"},
CaptchaTypeFriendlyCaptcha: {"frc-captcha", "friendlycaptcha"},
CaptchaTypeRotateCaptcha: {"rotate-captcha", "rotatecaptcha"},
CaptchaTypeClickCaptcha: {"click-captcha", "clickcaptcha"},
CaptchaTypeImageCaptcha: {"image-captcha", "imagecaptcha"},
CaptchaTypePuzzleCaptcha: {"puzzle-captcha", "__puzzle_captcha"},
CaptchaTypeSliderCaptcha: {"slider-captcha", "slidercaptcha", "slide-verify"},
CaptchaTypeDatadome: {"dd-challenge", "dd-top"},
CaptchaTypePerimeterX: {"_px3", "px-container"},
CaptchaTypeArgon: {"argon-captcha"},
CaptchaTypeSmartCaptcha: {"smart-captcha"},
CaptchaTypeYandex: {"smartcaptcha", "yandex-captcha"},
CaptchaTypeFuncaptcha: {"funcaptcha-container"},
CaptchaTypeMCaptcha: {"mcaptcha", "mcaptcha-container"},
CaptchaTypeKasada: {"kas", "kasada"},
CaptchaTypeImperva: {"_inc", "incapsula", "imperva"},
CaptchaTypeAwsWaf: {"aws-waf", "awswaf"},
}

html, _ := form.Html()
htmlLower := strings.ToLower(html)

for captchaType, classes := range classPatterns {
for _, class := range classes {
if strings.Contains(htmlLower, class) {
return captchaType
}
}
}

return CaptchaTypeNone
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Overly short class patterns risk false positives.

Patterns like "kas" (Line 290), "_inc" (Line 291), and "_px3" (Line 284) are very short substrings searched across the entire HTML. These can easily match unrelated content (e.g., a CSS class like "tasks" contains "kas", or "_include" contains "_inc"). Consider making these patterns more specific or matching against actual class attributes rather than raw HTML.

🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 269 - 307, The detectByClasses function
uses overly short substrings in classPatterns and searches the entire HTML
(htmlLower), causing false positives; change detection to read the form's
"class" attribute(s) and match against actual class tokens (split on whitespace)
or use stricter regex/word-boundary checks for each pattern (e.g., replace "kas"
with "kas-" or "^kas$" style matches, change "_inc" and "_px3" to more specific
tokens) inside detectByClasses so you only return a CaptchaType when an actual
class name matches the more specific pattern; update the classPatterns entries
accordingly and adjust the loop to inspect class attributes via goquery (e.g.,
form.Attr("class") or iterating child elements' class attrs) rather than
searching htmlLower.

Comment on lines +309 to +339
// detectByIframe checks for CAPTCHA-specific iframes
func detectByIframe(form *goquery.Selection) CaptchaType {
iframePatterns := map[CaptchaType][]string{
CaptchaTypeRecaptcha: {"recaptcha"},
CaptchaTypeHCaptcha: {"hcaptcha"},
CaptchaTurnstile: {"challenges.cloudflare.com"},
CaptchaTypeGeetest: {"geetest"},
CaptchaTypeSliderCaptcha: {"slidercaptcha", "slide-verify"},
CaptchaTypeMCaptcha: {"mcaptcha", "app.mcaptcha.io"},
CaptchaTypeYandex: {"yandex", "smartcaptcha"},
CaptchaTypeKasada: {"kasada", "kas"},
CaptchaTypeImperva: {"incapsula", "imperva"},
CaptchaTypeDatadome: {"datadome"},
}

html, _ := form.Html()
htmlLower := strings.ToLower(html)

// Check for iframe with CAPTCHA patterns in raw HTML
if strings.Contains(htmlLower, "iframe") {
for captchaType, patterns := range iframePatterns {
for _, pattern := range patterns {
if strings.Contains(htmlLower, pattern) && strings.Contains(htmlLower, "iframe") {
return captchaType
}
}
}
}

return CaptchaTypeNone
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

detectByIframe matches patterns anywhere in HTML, not within iframe elements.

Line 331 checks if htmlLower contains both "iframe" and a pattern like "recaptcha", but doesn't verify the pattern appears inside the iframe's src attribute. Any form that happens to contain an unrelated iframe and the word "recaptcha" elsewhere (e.g., in a comment or label) will be falsely detected.

Consider using goquery to select iframe elements and inspect their src attribute specifically, similar to how detectByScriptDomain inspects script[src].

♻️ Proposed fix
 func detectByIframe(form *goquery.Selection) CaptchaType {
 	iframePatterns := map[CaptchaType][]string{
 		// ...same patterns...
 	}
 
-	html, _ := form.Html()
-	htmlLower := strings.ToLower(html)
-
-	if strings.Contains(htmlLower, "iframe") {
-		for captchaType, patterns := range iframePatterns {
-			for _, pattern := range patterns {
-				if strings.Contains(htmlLower, pattern) && strings.Contains(htmlLower, "iframe") {
-					return captchaType
-				}
+	var iframeSrcs []string
+	form.Find("iframe").Each(func(_ int, s *goquery.Selection) {
+		if src, ok := s.Attr("src"); ok {
+			iframeSrcs = append(iframeSrcs, strings.ToLower(src))
+		}
+	})
+
+	for captchaType, patterns := range iframePatterns {
+		for _, src := range iframeSrcs {
+			for _, pattern := range patterns {
+				if strings.Contains(src, pattern) {
+					return captchaType
+				}
 			}
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// detectByIframe checks for CAPTCHA-specific iframes
func detectByIframe(form *goquery.Selection) CaptchaType {
iframePatterns := map[CaptchaType][]string{
CaptchaTypeRecaptcha: {"recaptcha"},
CaptchaTypeHCaptcha: {"hcaptcha"},
CaptchaTurnstile: {"challenges.cloudflare.com"},
CaptchaTypeGeetest: {"geetest"},
CaptchaTypeSliderCaptcha: {"slidercaptcha", "slide-verify"},
CaptchaTypeMCaptcha: {"mcaptcha", "app.mcaptcha.io"},
CaptchaTypeYandex: {"yandex", "smartcaptcha"},
CaptchaTypeKasada: {"kasada", "kas"},
CaptchaTypeImperva: {"incapsula", "imperva"},
CaptchaTypeDatadome: {"datadome"},
}
html, _ := form.Html()
htmlLower := strings.ToLower(html)
// Check for iframe with CAPTCHA patterns in raw HTML
if strings.Contains(htmlLower, "iframe") {
for captchaType, patterns := range iframePatterns {
for _, pattern := range patterns {
if strings.Contains(htmlLower, pattern) && strings.Contains(htmlLower, "iframe") {
return captchaType
}
}
}
}
return CaptchaTypeNone
}
// detectByIframe checks for CAPTCHA-specific iframes
func detectByIframe(form *goquery.Selection) CaptchaType {
iframePatterns := map[CaptchaType][]string{
CaptchaTypeRecaptcha: {"recaptcha"},
CaptchaTypeHCaptcha: {"hcaptcha"},
CaptchaTurnstile: {"challenges.cloudflare.com"},
CaptchaTypeGeetest: {"geetest"},
CaptchaTypeSliderCaptcha: {"slidercaptcha", "slide-verify"},
CaptchaTypeMCaptcha: {"mcaptcha", "app.mcaptcha.io"},
CaptchaTypeYandex: {"yandex", "smartcaptcha"},
CaptchaTypeKasada: {"kasada", "kas"},
CaptchaTypeImperva: {"incapsula", "imperva"},
CaptchaTypeDatadome: {"datadome"},
}
var iframeSrcs []string
form.Find("iframe").Each(func(_ int, s *goquery.Selection) {
if src, ok := s.Attr("src"); ok {
iframeSrcs = append(iframeSrcs, strings.ToLower(src))
}
})
for captchaType, patterns := range iframePatterns {
for _, src := range iframeSrcs {
for _, pattern := range patterns {
if strings.Contains(src, pattern) {
return captchaType
}
}
}
}
return CaptchaTypeNone
}
🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 309 - 339, detectByIframe currently scans
the entire form HTML for both "iframe" and patterns, causing false positives;
change it to iterate iframe elements via form.Find("iframe") (or Selection.Each)
and inspect each iframe's src (and possibly data attributes) by lowercasing the
src and checking against iframePatterns for a match; on the first match return
the corresponding CaptchaType (otherwise return CaptchaTypeNone). Ensure you
reference detectByIframe, iframePatterns, and use form.Find("iframe")/Each and
strings.Contains(srcLower, pattern) so the detection only triggers when the
pattern is actually in an iframe's src.

Comment on lines +370 to +408
func DetectCaptchaInHTML(html string) CaptchaType {
htmlLower := strings.ToLower(html)

// Priority 1: Domain-based detection patterns (most reliable)
domainPatterns := map[CaptchaType][]string{
CaptchaTypeRecaptcha: {"google.com/recaptcha", "gstatic.com", "recaptcha"},
CaptchaTypeRecaptchaV2: {"recaptcha/api.js", "recaptcha.*v2"},
CaptchaTypeRecaptchaInvisible: {"recaptcha.*invisible"},
CaptchaTypeHCaptcha: {"hcaptcha", "js.hcaptcha.com"},
CaptchaTurnstile: {"challenges.cloudflare.com", "js.cloudflare.com"},
CaptchaTypeGeetest: {"geetest", "api.geetest.com"},
CaptchaTypeFriendlyCaptcha: {"friendlycaptcha", "cdn.friendlycaptcha.com"},
CaptchaTypeRotateCaptcha: {"rotatecaptcha", "api.rotatecaptcha.com"},
CaptchaTypeClickCaptcha: {"clickcaptcha", "assets.clickcaptcha.com"},
CaptchaTypeImageCaptcha: {"imagecaptcha", "api.imagecaptcha.com"},
CaptchaTypePuzzleCaptcha: {"puzzle-captcha", "__puzzle_captcha"},
CaptchaTypeSliderCaptcha: {"slider-captcha", "slidercaptcha"},
CaptchaTypeMCaptcha: {"mcaptcha", "app.mcaptcha.io"},
CaptchaTypeKasada: {"kasada", "kas.kasadaproducts.com"},
CaptchaTypeImperva: {"incapsula", "imperva"},
CaptchaTypeAwsWaf: {"awswaf", "captcha.aws.amazon.com"},
CaptchaTypeDatadome: {"datadome", "dd-challenge"},
CaptchaTypePerimeterX: {"perimeterx", "_pxappid"},
CaptchaTypeArgon: {"argon-captcha"},
CaptchaTypeBehaviotech: {"behaviotech"},
CaptchaTypeSmartCaptcha: {"captcha.yandex.com", "smartcaptcha"},
CaptchaTypeYandex: {"yandex.com/.*captcha", "yandex.ru/.*captcha", "smartcaptcha.yandex"},
CaptchaTypeFuncaptcha: {"funcaptcha", "arkose"},
CaptchaTypeCoingecko: {"wsiz.com"},
CaptchaTypeNovaScape: {"novascape"},
}

for captchaType, patterns := range domainPatterns {
for _, pattern := range patterns {
if strings.Contains(htmlLower, pattern) {
return captchaType
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Regex-like patterns used with strings.Contains — they will never match.

Several entries in domainPatterns contain regex syntax (e.g., "recaptcha.*v2", "recaptcha.*invisible", "yandex.com/.*captcha") but are matched using strings.Contains, which treats them as literal strings. These patterns will never match real HTML.

Either use regexp for matching or convert these to plain substrings (e.g., "recaptcha/api.js" is already correct as a literal, but "recaptcha.*v2" is not).

🤖 Prompt for AI Agents
In `@classifier/captcha.go` around lines 370 - 408, DetectCaptchaInHTML uses
domainPatterns with regex-like entries (e.g., "recaptcha.*v2",
"recaptcha.*invisible", "yandex.com/.*captcha") but matches them with
strings.Contains, so those will never match; fix by either (A) replacing
regex-like patterns in the domainPatterns map with actual literal substrings
that exist in the HTML (e.g., "recaptcha/api.js" or "recaptcha v2" as
appropriate) or (B) switch matching to regular expressions: import regexp,
precompile each pattern from domainPatterns and use regexp.MatchString (or
compile once per pattern) when iterating in DetectCaptchaInHTML; update the map
keys/values and matching loop accordingly so regex patterns are evaluated
correctly.

Comment on lines +67 to +72
// Check if this field is classified as captcha (before thresholding)
if _, isCaptcha := probs["captcha"]; !isCaptcha {
// Only include if NOT a captcha field
thresholdedProbs := thresholdMap(probs, threshold)
result.Fields[name] = thresholdedProbs
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Captcha field exclusion ignores probability magnitude.

A field is excluded if "captcha" appears as any key in its probability map, regardless of how low the probability is. A field with {"email": 0.95, "captcha": 0.05} would be incorrectly excluded. This should mirror the non-proba path and check whether "captcha" is the most likely classification.

🐛 Proposed fix
 		for name, probs := range fieldProba {
-			// Check if this field is classified as captcha (before thresholding)
-			if _, isCaptcha := probs["captcha"]; !isCaptcha {
-				// Only include if NOT a captcha field
+			// Skip fields where captcha is the most likely classification
+			bestClass := ""
+			bestProb := -1.0
+			for cls, p := range probs {
+				if p > bestProb {
+					bestProb = p
+					bestClass = cls
+				}
+			}
+			if bestClass != "captcha" {
 				thresholdedProbs := thresholdMap(probs, threshold)
 				result.Fields[name] = thresholdedProbs
 			}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Check if this field is classified as captcha (before thresholding)
if _, isCaptcha := probs["captcha"]; !isCaptcha {
// Only include if NOT a captcha field
thresholdedProbs := thresholdMap(probs, threshold)
result.Fields[name] = thresholdedProbs
}
// Skip fields where captcha is the most likely classification
bestClass := ""
bestProb := -1.0
for cls, p := range probs {
if p > bestProb {
bestProb = p
bestClass = cls
}
}
if bestClass != "captcha" {
thresholdedProbs := thresholdMap(probs, threshold)
result.Fields[name] = thresholdedProbs
}
🤖 Prompt for AI Agents
In `@classifier/classifier.go` around lines 67 - 72, The current check excludes a
field if the "captcha" key exists in the probs map; instead determine the most
likely label and only exclude when "captcha" is the argmax. Replace the presence
check around thresholdMap(probs, threshold) with logic that iterates probs to
find the label with the highest probability (e.g., compute maxName/maxProb from
probs) and only skip adding to result.Fields[name] when maxName == "captcha";
otherwise thresholdMap(probs, threshold) and assign as before to
result.Fields[name].

Copy link
Member

@dogancanbakir dogancanbakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommending different approach

@@ -0,0 +1,509 @@
package classifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be outside of the classifier, classifier pkg is for ML classifiers. What about a pkg under the name of captcha, outside of classifier?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is a much more professional approach. I will implement this.

Form string `json:"form"`
Fields map[string]string `json:"fields,omitempty"`
Form string `json:"form"`
Captcha string `json:"captcha,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments