Skip to content

Add geothermal electricity extraction support#400

Open
bpulluta wants to merge 12 commits intomainfrom
feature/geothermal-extraction-pr
Open

Add geothermal electricity extraction support#400
bpulluta wants to merge 12 commits intomainfrom
feature/geothermal-extraction-pr

Conversation

@bpulluta
Copy link
Collaborator

Overview

This PR adds geothermal electricity as a supported extraction technology in COMPASS. It includes the extraction schema and plugin configuration needed to discover, retrieve, and extract structured ordinance data from jurisdictions governing utility-scale geothermal electricity generation.

Two bugs in the retrieval layer were discovered and fixed during development. Both affected all technologies, not just geothermal.


New: Geothermal Electricity Extraction

Files added:

  • compass/extraction/geothermal_electricity/geothermal_schema.json — defines 29 extractable features including setbacks, permitting, noise limits, zoning classifications, decommissioning, and drilling requirements
  • compass/extraction/geothermal_electricity/geothermal_plugin_config.yaml — configures search queries, website scoring keywords, heuristic filters, and document collection behavior tuned for geothermal electricity ordinances

The schema follows the standard COMPASS one-shot extraction format and is compatible with the existing compass process pipeline with no code changes required.


Bug Fix 1 — PDF URLs with spaces failed to download

crawl4ai can return document URLs with raw spaces in the path (e.g. a county storing files under a folder named Land Use Code/):

# broken — HTTP request fails silently
https://countygov.org/Land Use Code/53007.pdf

# fixed — percent-encoded, downloads correctly
https://countygov.org/Land%20Use%20Code/53007.pdf

_sanitize_doc_sources() now percent-encodes any source attribute with raw spaces before returning from download_jurisdiction_ordinances_from_website(). Uses Python stdlib urllib.parse, no new dependencies.

File: compass/scripts/download.py


Bug Fix 2 — Anchor text was never used in link scoring

When the COMPASS crawler parses a page, it reads each link's visible label and stored it in the title field. The upstream scorer (ELMLinkScorer) reads the text key. These never matched, so anchor text always scored zero regardless of keyword weights — only the URL filename was used.

# before — anchor text stored in title, scorer reads text (always empty)
_Link(title=title, href=..., base_domain=...)

# after — text field populated, scorer sees the anchor label
_Link(title=title, text=title, href=..., base_domain=...)

File: compass/web/website_crawl.py


Tests

Two regression tests added to tests/python/unit/web/test_web_crawl.py:

  • test_extract_links_from_html_sets_text_from_anchor — verifies anchor text populates both title and text on the link object
  • test_compass_link_scorer_scores_anchor_text — verifies the scorer uses text when assigning keyword scores

All 30 unit tests pass.

bpulluta and others added 10 commits March 17, 2026 12:11
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…rmatting (#399)

* Initial plan

* Fix all review comments in skills documentation

Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com>
…eval

- percent-encode raw spaces in crawl4ai PDF source URLs before downstream use
- populate link text field from anchor text so ELMLinkScorer can score link labels
- add two regression tests covering both fixes
Copilot AI review requested due to automatic review settings March 20, 2026 23:49

This comment was marked as resolved.

@codecov-commenter

This comment was marked as resolved.

@bpulluta

This comment was marked as resolved.

This comment was marked as resolved.

bpulluta and others added 2 commits March 20, 2026 23:06
… test (#401)

* Initial plan

* Extract shared _sanitize_url to url_utils.py, simplify to space-only encoding, fix test robustness

Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com>
Agent-Logs-Url: https://github.com/NatLabRockies/COMPASS/sessions/ceb782b4-c312-41d1-b4eb-eccbbef67097

* fix failing test

* ruff error fix

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants