esgextract-be/PHASE.md at master · Chillgorithm/esgextract-be

Develop Phase Plan: Financial Report Processing API

Objectives

Build an API that processes PDF financial reports and returns consolidated JSON, meeting the PRD’s functional, performance, and reliability criteria.
Implement idempotent, cache-aware batch processing with robust error handling.

Scope (This Phase)

End-to-end implementation of Steps 1–6 in the PRD, including directory/file validation, Gemini integrations, PDF appendix extraction, per-file JSON outputs, consolidation, and API response.
Automated tests (unit + integration + e2e with fixtures), logs/metrics, and developer ergonomics.

High-Level Architecture

API layer: A single endpoint (GET or POST) that orchestrates the processing pipeline.
Orchestrator/service: Coordinates Steps 1–6, parallelizes per-file work, and ensures completion synchronization.
Integrations:
- Gemini API: appendix range detection and field extraction.
- PDF processing library: page extraction to build appendix-only PDFs.
Storage (local FS):
- ./reports: input PDFs
- ./preprocessing: appendix-only PDFs
- ./processed: per-file JSONs and result.json
- ./config/values.json, .env (requires LAST_MODIFIED)

Milestones & Deliverables

M1: Project bootstrap
- Project skeleton, env/config loading, logging, error model, basic tests.
M2: Core pipeline
- Directory/file validation (2.2), Step 1 timestamp check, parallel fan-out (2.1.2), per-file sequential sub-steps (2.1.3), completion sync (2.1.4).
M3: Consolidation + API
- Merge per-file JSONs, add last_modified, implement endpoint response (omit LAST_MODIFIED), end-to-end tests.
M4: Hardening
- Error handling per PRD, retry/backoff for Gemini, performance tuning, observability.

Implementation Steps (Aligned to PRD)

Setup & Foundations

Choose stack: e.g., Node.js (Express/Fastify) or Python (FastAPI). Add dependency manager, scripts, lint/format/test.
Config: Load .env (must include LAST_MODIFIED); validate presence of ./config/values.json.
Dirs: Ensure creation of ./reports, ./preprocessing, ./processed, ./config on startup and/or per-request.
Logging: Structured logs (request id, file name, step, duration). Add basic metrics timers.

Step 1: Initial Validation (Cache)

Read ./processed/result.json if present.
Compare its LAST_MODIFIED with .env LAST_MODIFIED.
- If equal: short-circuit to Step 6 (serve cleaned result.json).
- If different/missing: proceed to Step 2.

Step 2: Parallel File Processing (Fan-out)

Enumerate PDFs in ./reports.
For each file, launch a task in a bounded worker pool (configurable concurrency).

Step 3: Per-file Sequential Steps (In-task)

3.1 Processed check: If ./processed/[filename].json exists, skip to 3.6.
3.2 Preprocessing check: If ./preprocessing/[filename] exists, skip to 3.5.
3.3 Appendix detection (Gemini): Ask for appendix start/end pages.
3.4 PDF extraction: Use PDF library to slice pages into ./preprocessing/[filename].
3.5 Data extraction (Gemini + values.json): Request structured JSON per schema; no citations.
3.6 Save as ./processed/[filename].json (valid JSON). Validate includes company name.

Step 4: Completion Synchronization

Await all tasks; verify expected JSON files exist for all input PDFs.

Step 5: Result Consolidation

Merge all ./processed/*.json into one object: { [companyName]: data, last_modified: <.env LAST_MODIFIED> }.
Save as ./processed/result.json.

Step 6: API Response

Read ./processed/result.json, omit LAST_MODIFIED in response body, return 200 JSON.

Error Handling (Per PRD)

File system: Missing required files/dirs, permissions, disk space ⇒ 500.
AI/Gemini: API failures, invalid PDF, extraction errors ⇒ 500. Add limited retries with backoff and clear error messages.
Data processing: JSON parse or invalid config ⇒ 500.
Ensure errors short-circuit the batch with a single 500 response.

Performance & Concurrency

Bounded parallelism for Step 2 tasks; sequential sub-steps within each task.
Reuse preprocessed and processed artifacts when present to avoid recomputation.
Avoid large in-memory buffers; stream PDFs where possible.

Configuration & Secrets

.env: LAST_MODIFIED, Gemini credentials, concurrency limits, timeouts.
./config/values.json: validated at startup; schema-checked to avoid runtime surprises.

Testing Strategy

Unit: utilities (env loader, file discovery, JSON merge, filters).
Integration: stub Gemini client; golden PDFs and expected JSONs; PDF extraction round-trip.
E2E: fixture PDFs in ./reports, run endpoint, assert result.json and response shape/idempotency.
Negative cases: missing files, bad PDFs, Gemini failures.

Observability & Ops

Structured logs with correlation ids; durations per step.
Basic metrics: files processed, successes/failures, retries, durations.

Acceptance Criteria

Meets Steps 1–6 behavior; returns 200 with consolidated JSON minus LAST_MODIFIED when successful; returns 500 on specified failures.
Idempotent with unchanged .env LAST_MODIFIED; reprocesses when changed.
Parallel processing verified; artifacts created in correct directories.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop Phase Plan: Financial Report Processing API

Objectives

Scope (This Phase)

High-Level Architecture

Milestones & Deliverables

Implementation Steps (Aligned to PRD)

Error Handling (Per PRD)

Performance & Concurrency

Configuration & Secrets

Testing Strategy

Observability & Ops

Acceptance Criteria

FilesExpand file tree

PHASE.md

Latest commit

History

PHASE.md

File metadata and controls

Develop Phase Plan: Financial Report Processing API

Objectives

Scope (This Phase)

High-Level Architecture

Milestones & Deliverables

Implementation Steps (Aligned to PRD)

Error Handling (Per PRD)

Performance & Concurrency

Configuration & Secrets

Testing Strategy

Observability & Ops

Acceptance Criteria