Skip to content

AdaptiveCrawler.digest() ignores include_external=False: external domains added to pending_links unconditionally #1776

@haroldparis

Description

@haroldparis

Bug Report

Summary

AdaptiveCrawler.digest() crawls external domains even though _crawl_with_preview() sets include_external=False in LinkPreviewConfig. The include_external flag controls which links are scored, but digest() then unconditionally adds both internal and external links to state.pending_links, bypassing the filter entirely.

Version

  • crawl4ai: 0.8.0
  • Python: 3.12

Root Cause

The misleading flag

_crawl_with_preview() correctly sets include_external=False:

# adaptive_crawler.py, line ~1458
async def _crawl_with_preview(self, url: str, query: str) -> Optional[CrawlResult]:
    config = CrawlerRunConfig(
        link_preview_config=LinkPreviewConfig(
            include_internal=True,
            include_external=False,  # ← intended to restrict crawl to same domain
            ...
        ),
        ...
    )

However, LinkPreviewConfig.include_external only controls link scoring (BM25/embedding), not what gets added to the crawl queue.

The bug: both passes unconditionally extend pending_links

First crawl (initial page, lines ~1354–1360):

if isinstance(result.links, dict):
    internal_links = [Link(**link) for link in result.links.get('internal', [])]
    external_links = [Link(**link) for link in result.links.get('external', [])]
    self.state.pending_links.extend(internal_links + external_links)  # ← external included!
else:
    self.state.pending_links.extend(result.links.internal + result.links.external)  # ← same

Subsequent crawls (main loop, lines ~1408–1419):

if isinstance(result.links, dict):
    internal_links = [Link(**link_data) for link_data in result.links.get('internal', [])]
    external_links = [Link(**link_data) for link_data in result.links.get('external', [])]
    new_links = internal_links + external_links  # ← external included!
else:
    new_links = result.links.internal + result.links.external  # ← same

for new_link in new_links:
    if new_link.href not in self.state.crawled_urls:
        self.state.pending_links.append(new_link)

In both cases, external_links — links pointing to entirely different domains — end up in pending_links and are subsequently crawled.


Observed Behavior

When crawling https://newsetiquettes.fr/, AdaptiveCrawler followed links to:

[FETCH] https://haroldparis.fr/
[FETCH] https://earthmoon.fr/
[FETCH] https://go.haroldparis.fr/rapiiide
[FETCH] https://go.haroldparis.fr/newsletter
[FETCH] https://cookiedatabase.org/tcf/purposes
[FETCH] https://go.haroldparis.fr/deviantart
[FETCH] https://go.haroldparis.fr/behance
[FETCH] https://www.deviantart.com/haroldparis/about
[FETCH] https://www.behance.net/haroldparis/info
[FETCH] https://www.deviantart.com/community
[FETCH] https://www.deviantartsupport.com/en
...

These are all external domains found in links on the target site. The max_pages=15 budget was consumed by crawling third-party domains instead of the target site's own pages.


Expected Behavior

When include_external=False is set in LinkPreviewConfig (or equivalent), digest() should restrict pending_links to URLs belonging to the same root domain as start_url (allowing subdomains, e.g. blog.example.com for example.com).


Workaround

We worked around this by wrapping the strategy with a custom class that filters pending_links after each update_state call:

from urllib.parse import urlparse
from crawl4ai.adaptive_crawler import CrawlState, CrawlStrategy, AdaptiveConfig

def _root_domain(netloc: str) -> str:
    parts = netloc.split(".")
    return ".".join(parts[-2:]) if len(parts) >= 2 else netloc

class DomainBoundStrategy(CrawlStrategy):
    """Filters pending_links to the target domain after each update_state."""

    def __init__(self, strategy: CrawlStrategy, target_domain: str) -> None:
        self._strategy = strategy
        self._target_domain = target_domain

    def __getattr__(self, name: str) -> object:
        return getattr(self._strategy, name)

    async def calculate_confidence(self, state: CrawlState) -> float:
        return await self._strategy.calculate_confidence(state)

    async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> list[tuple]:
        return await self._strategy.rank_links(state, config)

    async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool:
        return await self._strategy.should_stop(state, config)

    async def update_state(self, state: CrawlState, new_results: list) -> None:
        await self._strategy.update_state(state, new_results)
        # Purge external links that digest() unconditionally added to pending_links
        state.pending_links = [
            link for link in state.pending_links
            if _root_domain(urlparse(link.href or "").netloc) == self._target_domain
        ]

# Usage:
target_domain = _root_domain(urlparse(start_url).netloc)
adaptive_crawler.strategy = DomainBoundStrategy(adaptive_crawler.strategy, target_domain)

This works, but it's fragile (relies on update_state being called immediately after pending_links is extended) and requires users to subclass internal abstractions.


Suggested Fix

In digest(), filter pending_links based on whether include_external is False. The fix could be applied in either location:

Option A — filter in digest() directly:

# After: self.state.pending_links.extend(internal_links + external_links)
# Add:
if not link_preview_config.include_external:
    self.state.pending_links = [
        link for link in self.state.pending_links
        if not is_external_url(link.href, base_domain)
    ]

Option B — expose allowed_domains in AdaptiveConfig:

Add an allowed_domains: list[str] | None = None parameter to AdaptiveConfig and filter pending_links accordingly in digest(). This would be more explicit and flexible (e.g. allowing cross-domain crawls when explicitly requested).

Option C — default to domain-restricted crawling:

Change the default behavior so that digest() restricts crawling to the root domain of start_url unless explicitly configured otherwise (e.g. allow_external_domains=True).


Additional Notes

  • DomainFilter exists in deep_crawling/filters.py but is not wired into AdaptiveCrawler.
  • is_external_url() already exists in utils.py and handles subdomain detection correctly — it could be reused directly.
  • The bug wastes the max_pages budget on irrelevant external domains, significantly degrading the quality of adaptive crawling for sites with many outbound links.

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 BugSomething isn't working📌 Root causedidentified the root cause of bug

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions