-
-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Description
Bug Report
Summary
AdaptiveCrawler.digest() crawls external domains even though _crawl_with_preview() sets include_external=False in LinkPreviewConfig. The include_external flag controls which links are scored, but digest() then unconditionally adds both internal and external links to state.pending_links, bypassing the filter entirely.
Version
- crawl4ai: 0.8.0
- Python: 3.12
Root Cause
The misleading flag
_crawl_with_preview() correctly sets include_external=False:
# adaptive_crawler.py, line ~1458
async def _crawl_with_preview(self, url: str, query: str) -> Optional[CrawlResult]:
config = CrawlerRunConfig(
link_preview_config=LinkPreviewConfig(
include_internal=True,
include_external=False, # ← intended to restrict crawl to same domain
...
),
...
)However, LinkPreviewConfig.include_external only controls link scoring (BM25/embedding), not what gets added to the crawl queue.
The bug: both passes unconditionally extend pending_links
First crawl (initial page, lines ~1354–1360):
if isinstance(result.links, dict):
internal_links = [Link(**link) for link in result.links.get('internal', [])]
external_links = [Link(**link) for link in result.links.get('external', [])]
self.state.pending_links.extend(internal_links + external_links) # ← external included!
else:
self.state.pending_links.extend(result.links.internal + result.links.external) # ← sameSubsequent crawls (main loop, lines ~1408–1419):
if isinstance(result.links, dict):
internal_links = [Link(**link_data) for link_data in result.links.get('internal', [])]
external_links = [Link(**link_data) for link_data in result.links.get('external', [])]
new_links = internal_links + external_links # ← external included!
else:
new_links = result.links.internal + result.links.external # ← same
for new_link in new_links:
if new_link.href not in self.state.crawled_urls:
self.state.pending_links.append(new_link)In both cases, external_links — links pointing to entirely different domains — end up in pending_links and are subsequently crawled.
Observed Behavior
When crawling https://newsetiquettes.fr/, AdaptiveCrawler followed links to:
[FETCH] https://haroldparis.fr/
[FETCH] https://earthmoon.fr/
[FETCH] https://go.haroldparis.fr/rapiiide
[FETCH] https://go.haroldparis.fr/newsletter
[FETCH] https://cookiedatabase.org/tcf/purposes
[FETCH] https://go.haroldparis.fr/deviantart
[FETCH] https://go.haroldparis.fr/behance
[FETCH] https://www.deviantart.com/haroldparis/about
[FETCH] https://www.behance.net/haroldparis/info
[FETCH] https://www.deviantart.com/community
[FETCH] https://www.deviantartsupport.com/en
...
These are all external domains found in links on the target site. The max_pages=15 budget was consumed by crawling third-party domains instead of the target site's own pages.
Expected Behavior
When include_external=False is set in LinkPreviewConfig (or equivalent), digest() should restrict pending_links to URLs belonging to the same root domain as start_url (allowing subdomains, e.g. blog.example.com for example.com).
Workaround
We worked around this by wrapping the strategy with a custom class that filters pending_links after each update_state call:
from urllib.parse import urlparse
from crawl4ai.adaptive_crawler import CrawlState, CrawlStrategy, AdaptiveConfig
def _root_domain(netloc: str) -> str:
parts = netloc.split(".")
return ".".join(parts[-2:]) if len(parts) >= 2 else netloc
class DomainBoundStrategy(CrawlStrategy):
"""Filters pending_links to the target domain after each update_state."""
def __init__(self, strategy: CrawlStrategy, target_domain: str) -> None:
self._strategy = strategy
self._target_domain = target_domain
def __getattr__(self, name: str) -> object:
return getattr(self._strategy, name)
async def calculate_confidence(self, state: CrawlState) -> float:
return await self._strategy.calculate_confidence(state)
async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> list[tuple]:
return await self._strategy.rank_links(state, config)
async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool:
return await self._strategy.should_stop(state, config)
async def update_state(self, state: CrawlState, new_results: list) -> None:
await self._strategy.update_state(state, new_results)
# Purge external links that digest() unconditionally added to pending_links
state.pending_links = [
link for link in state.pending_links
if _root_domain(urlparse(link.href or "").netloc) == self._target_domain
]
# Usage:
target_domain = _root_domain(urlparse(start_url).netloc)
adaptive_crawler.strategy = DomainBoundStrategy(adaptive_crawler.strategy, target_domain)This works, but it's fragile (relies on update_state being called immediately after pending_links is extended) and requires users to subclass internal abstractions.
Suggested Fix
In digest(), filter pending_links based on whether include_external is False. The fix could be applied in either location:
Option A — filter in digest() directly:
# After: self.state.pending_links.extend(internal_links + external_links)
# Add:
if not link_preview_config.include_external:
self.state.pending_links = [
link for link in self.state.pending_links
if not is_external_url(link.href, base_domain)
]Option B — expose allowed_domains in AdaptiveConfig:
Add an allowed_domains: list[str] | None = None parameter to AdaptiveConfig and filter pending_links accordingly in digest(). This would be more explicit and flexible (e.g. allowing cross-domain crawls when explicitly requested).
Option C — default to domain-restricted crawling:
Change the default behavior so that digest() restricts crawling to the root domain of start_url unless explicitly configured otherwise (e.g. allow_external_domains=True).
Additional Notes
DomainFilterexists indeep_crawling/filters.pybut is not wired intoAdaptiveCrawler.is_external_url()already exists inutils.pyand handles subdomain detection correctly — it could be reused directly.- The bug wastes the
max_pagesbudget on irrelevant external domains, significantly degrading the quality of adaptive crawling for sites with many outbound links.