Add geothermal electricity extraction support#400
Open
Conversation
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…rmatting (#399) * Initial plan * Fix all review comments in skills documentation Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com>
…eval - percent-encode raw spaces in crawl4ai PDF source URLs before downstream use - populate link text field from anchor text so ELMLinkScorer can score link labels - add two regression tests covering both fixes
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
… test (#401) * Initial plan * Extract shared _sanitize_url to url_utils.py, simplify to space-only encoding, fix test robustness Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com> Agent-Logs-Url: https://github.com/NatLabRockies/COMPASS/sessions/ceb782b4-c312-41d1-b4eb-eccbbef67097 * fix failing test * ruff error fix --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds geothermal electricity as a supported extraction technology in COMPASS. It includes the extraction schema and plugin configuration needed to discover, retrieve, and extract structured ordinance data from jurisdictions governing utility-scale geothermal electricity generation.
Two bugs in the retrieval layer were discovered and fixed during development. Both affected all technologies, not just geothermal.
New: Geothermal Electricity Extraction
Files added:
compass/extraction/geothermal_electricity/geothermal_schema.json— defines 29 extractable features including setbacks, permitting, noise limits, zoning classifications, decommissioning, and drilling requirementscompass/extraction/geothermal_electricity/geothermal_plugin_config.yaml— configures search queries, website scoring keywords, heuristic filters, and document collection behavior tuned for geothermal electricity ordinancesThe schema follows the standard COMPASS one-shot extraction format and is compatible with the existing
compass processpipeline with no code changes required.Bug Fix 1 — PDF URLs with spaces failed to download
crawl4aican return document URLs with raw spaces in the path (e.g. a county storing files under a folder namedLand Use Code/):_sanitize_doc_sources()now percent-encodes anysourceattribute with raw spaces before returning fromdownload_jurisdiction_ordinances_from_website(). Uses Python stdliburllib.parse, no new dependencies.File:
compass/scripts/download.pyBug Fix 2 — Anchor text was never used in link scoring
When the COMPASS crawler parses a page, it reads each link's visible label and stored it in the
titlefield. The upstream scorer (ELMLinkScorer) reads thetextkey. These never matched, so anchor text always scored zero regardless of keyword weights — only the URL filename was used.File:
compass/web/website_crawl.pyTests
Two regression tests added to
tests/python/unit/web/test_web_crawl.py:test_extract_links_from_html_sets_text_from_anchor— verifies anchor text populates bothtitleandtexton the link objecttest_compass_link_scorer_scores_anchor_text— verifies the scorer usestextwhen assigning keyword scoresAll 30 unit tests pass.