AdaptiveCrawler.digest() ignores include_external=False: external domains added to pending_links unconditionally

## Bug Report

### Summary

`AdaptiveCrawler.digest()` crawls external domains even though `_crawl_with_preview()` sets `include_external=False` in `LinkPreviewConfig`. The `include_external` flag controls which links are **scored**, but `digest()` then unconditionally adds **both** `internal` and `external` links to `state.pending_links`, bypassing the filter entirely.

### Version

- **crawl4ai**: 0.8.0
- **Python**: 3.12

---

## Root Cause

### The misleading flag

`_crawl_with_preview()` correctly sets `include_external=False`:

```python
# adaptive_crawler.py, line ~1458
async def _crawl_with_preview(self, url: str, query: str) -> Optional[CrawlResult]:
    config = CrawlerRunConfig(
        link_preview_config=LinkPreviewConfig(
            include_internal=True,
            include_external=False,  # ← intended to restrict crawl to same domain
            ...
        ),
        ...
    )
```

However, `LinkPreviewConfig.include_external` only controls link **scoring** (BM25/embedding), not what gets added to the crawl queue.

### The bug: both passes unconditionally extend `pending_links`

**First crawl** (initial page, lines ~1354–1360):

```python
if isinstance(result.links, dict):
    internal_links = [Link(**link) for link in result.links.get('internal', [])]
    external_links = [Link(**link) for link in result.links.get('external', [])]
    self.state.pending_links.extend(internal_links + external_links)  # ← external included!
else:
    self.state.pending_links.extend(result.links.internal + result.links.external)  # ← same
```

**Subsequent crawls** (main loop, lines ~1408–1419):

```python
if isinstance(result.links, dict):
    internal_links = [Link(**link_data) for link_data in result.links.get('internal', [])]
    external_links = [Link(**link_data) for link_data in result.links.get('external', [])]
    new_links = internal_links + external_links  # ← external included!
else:
    new_links = result.links.internal + result.links.external  # ← same

for new_link in new_links:
    if new_link.href not in self.state.crawled_urls:
        self.state.pending_links.append(new_link)
```

In both cases, `external_links` — links pointing to **entirely different domains** — end up in `pending_links` and are subsequently crawled.

---

## Observed Behavior

When crawling `https://newsetiquettes.fr/`, `AdaptiveCrawler` followed links to:

```
[FETCH] https://haroldparis.fr/
[FETCH] https://earthmoon.fr/
[FETCH] https://go.haroldparis.fr/rapiiide
[FETCH] https://go.haroldparis.fr/newsletter
[FETCH] https://cookiedatabase.org/tcf/purposes
[FETCH] https://go.haroldparis.fr/deviantart
[FETCH] https://go.haroldparis.fr/behance
[FETCH] https://www.deviantart.com/haroldparis/about
[FETCH] https://www.behance.net/haroldparis/info
[FETCH] https://www.deviantart.com/community
[FETCH] https://www.deviantartsupport.com/en
...
```

These are all **external domains** found in links on the target site. The `max_pages=15` budget was consumed by crawling third-party domains instead of the target site's own pages.

---

## Expected Behavior

When `include_external=False` is set in `LinkPreviewConfig` (or equivalent), `digest()` should restrict `pending_links` to URLs belonging to the **same root domain** as `start_url` (allowing subdomains, e.g. `blog.example.com` for `example.com`).

---

## Workaround

We worked around this by wrapping the strategy with a custom class that filters `pending_links` after each `update_state` call:

```python
from urllib.parse import urlparse
from crawl4ai.adaptive_crawler import CrawlState, CrawlStrategy, AdaptiveConfig

def _root_domain(netloc: str) -> str:
    parts = netloc.split(".")
    return ".".join(parts[-2:]) if len(parts) >= 2 else netloc

class DomainBoundStrategy(CrawlStrategy):
    """Filters pending_links to the target domain after each update_state."""

    def __init__(self, strategy: CrawlStrategy, target_domain: str) -> None:
        self._strategy = strategy
        self._target_domain = target_domain

    def __getattr__(self, name: str) -> object:
        return getattr(self._strategy, name)

    async def calculate_confidence(self, state: CrawlState) -> float:
        return await self._strategy.calculate_confidence(state)

    async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> list[tuple]:
        return await self._strategy.rank_links(state, config)

    async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool:
        return await self._strategy.should_stop(state, config)

    async def update_state(self, state: CrawlState, new_results: list) -> None:
        await self._strategy.update_state(state, new_results)
        # Purge external links that digest() unconditionally added to pending_links
        state.pending_links = [
            link for link in state.pending_links
            if _root_domain(urlparse(link.href or "").netloc) == self._target_domain
        ]

# Usage:
target_domain = _root_domain(urlparse(start_url).netloc)
adaptive_crawler.strategy = DomainBoundStrategy(adaptive_crawler.strategy, target_domain)
```

This works, but it's fragile (relies on `update_state` being called immediately after `pending_links` is extended) and requires users to subclass internal abstractions.

---

## Suggested Fix

In `digest()`, filter `pending_links` based on whether `include_external` is `False`. The fix could be applied in either location:

**Option A — filter in `digest()` directly:**

```python
# After: self.state.pending_links.extend(internal_links + external_links)
# Add:
if not link_preview_config.include_external:
    self.state.pending_links = [
        link for link in self.state.pending_links
        if not is_external_url(link.href, base_domain)
    ]
```

**Option B — expose `allowed_domains` in `AdaptiveConfig`:**

Add an `allowed_domains: list[str] | None = None` parameter to `AdaptiveConfig` and filter `pending_links` accordingly in `digest()`. This would be more explicit and flexible (e.g. allowing cross-domain crawls when explicitly requested).

**Option C — default to domain-restricted crawling:**

Change the default behavior so that `digest()` restricts crawling to the root domain of `start_url` unless explicitly configured otherwise (e.g. `allow_external_domains=True`).

---

## Additional Notes

- `DomainFilter` exists in `deep_crawling/filters.py` but is not wired into `AdaptiveCrawler`.
- `is_external_url()` already exists in `utils.py` and handles subdomain detection correctly — it could be reused directly.
- The bug wastes the `max_pages` budget on irrelevant external domains, significantly degrading the quality of adaptive crawling for sites with many outbound links.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AdaptiveCrawler.digest() ignores include_external=False: external domains added to pending_links unconditionally #1776

Bug Report

Summary

Version

Root Cause

The misleading flag

The bug: both passes unconditionally extend `pending_links`

Observed Behavior

Expected Behavior

Workaround

Suggested Fix

Additional Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

AdaptiveCrawler.digest() ignores include_external=False: external domains added to pending_links unconditionally #1776

Description

Bug Report

Summary

Version

Root Cause

The misleading flag

The bug: both passes unconditionally extend pending_links

Observed Behavior

Expected Behavior

Workaround

Suggested Fix

Additional Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

The bug: both passes unconditionally extend `pending_links`