Skip to content

Latest commit

 

History

History
96 lines (79 loc) · 5.22 KB

File metadata and controls

96 lines (79 loc) · 5.22 KB

Develop Phase Plan: Financial Report Processing API

Objectives

  • Build an API that processes PDF financial reports and returns consolidated JSON, meeting the PRD’s functional, performance, and reliability criteria.
  • Implement idempotent, cache-aware batch processing with robust error handling.

Scope (This Phase)

  • End-to-end implementation of Steps 1–6 in the PRD, including directory/file validation, Gemini integrations, PDF appendix extraction, per-file JSON outputs, consolidation, and API response.
  • Automated tests (unit + integration + e2e with fixtures), logs/metrics, and developer ergonomics.

High-Level Architecture

  • API layer: A single endpoint (GET or POST) that orchestrates the processing pipeline.
  • Orchestrator/service: Coordinates Steps 1–6, parallelizes per-file work, and ensures completion synchronization.
  • Integrations:
    • Gemini API: appendix range detection and field extraction.
    • PDF processing library: page extraction to build appendix-only PDFs.
  • Storage (local FS):
    • ./reports: input PDFs
    • ./preprocessing: appendix-only PDFs
    • ./processed: per-file JSONs and result.json
    • ./config/values.json, .env (requires LAST_MODIFIED)

Milestones & Deliverables

  • M1: Project bootstrap
    • Project skeleton, env/config loading, logging, error model, basic tests.
  • M2: Core pipeline
    • Directory/file validation (2.2), Step 1 timestamp check, parallel fan-out (2.1.2), per-file sequential sub-steps (2.1.3), completion sync (2.1.4).
  • M3: Consolidation + API
    • Merge per-file JSONs, add last_modified, implement endpoint response (omit LAST_MODIFIED), end-to-end tests.
  • M4: Hardening
    • Error handling per PRD, retry/backoff for Gemini, performance tuning, observability.

Implementation Steps (Aligned to PRD)

  1. Setup & Foundations
  • Choose stack: e.g., Node.js (Express/Fastify) or Python (FastAPI). Add dependency manager, scripts, lint/format/test.
  • Config: Load .env (must include LAST_MODIFIED); validate presence of ./config/values.json.
  • Dirs: Ensure creation of ./reports, ./preprocessing, ./processed, ./config on startup and/or per-request.
  • Logging: Structured logs (request id, file name, step, duration). Add basic metrics timers.
  1. Step 1: Initial Validation (Cache)
  • Read ./processed/result.json if present.
  • Compare its LAST_MODIFIED with .env LAST_MODIFIED.
    • If equal: short-circuit to Step 6 (serve cleaned result.json).
    • If different/missing: proceed to Step 2.
  1. Step 2: Parallel File Processing (Fan-out)
  • Enumerate PDFs in ./reports.
  • For each file, launch a task in a bounded worker pool (configurable concurrency).
  1. Step 3: Per-file Sequential Steps (In-task)
  • 3.1 Processed check: If ./processed/[filename].json exists, skip to 3.6.
  • 3.2 Preprocessing check: If ./preprocessing/[filename] exists, skip to 3.5.
  • 3.3 Appendix detection (Gemini): Ask for appendix start/end pages.
  • 3.4 PDF extraction: Use PDF library to slice pages into ./preprocessing/[filename].
  • 3.5 Data extraction (Gemini + values.json): Request structured JSON per schema; no citations.
  • 3.6 Save as ./processed/[filename].json (valid JSON). Validate includes company name.
  1. Step 4: Completion Synchronization
  • Await all tasks; verify expected JSON files exist for all input PDFs.
  1. Step 5: Result Consolidation
  • Merge all ./processed/*.json into one object: { [companyName]: data, last_modified: <.env LAST_MODIFIED> }.
  • Save as ./processed/result.json.
  1. Step 6: API Response
  • Read ./processed/result.json, omit LAST_MODIFIED in response body, return 200 JSON.

Error Handling (Per PRD)

  • File system: Missing required files/dirs, permissions, disk space ⇒ 500.
  • AI/Gemini: API failures, invalid PDF, extraction errors ⇒ 500. Add limited retries with backoff and clear error messages.
  • Data processing: JSON parse or invalid config ⇒ 500.
  • Ensure errors short-circuit the batch with a single 500 response.

Performance & Concurrency

  • Bounded parallelism for Step 2 tasks; sequential sub-steps within each task.
  • Reuse preprocessed and processed artifacts when present to avoid recomputation.
  • Avoid large in-memory buffers; stream PDFs where possible.

Configuration & Secrets

  • .env: LAST_MODIFIED, Gemini credentials, concurrency limits, timeouts.
  • ./config/values.json: validated at startup; schema-checked to avoid runtime surprises.

Testing Strategy

  • Unit: utilities (env loader, file discovery, JSON merge, filters).
  • Integration: stub Gemini client; golden PDFs and expected JSONs; PDF extraction round-trip.
  • E2E: fixture PDFs in ./reports, run endpoint, assert result.json and response shape/idempotency.
  • Negative cases: missing files, bad PDFs, Gemini failures.

Observability & Ops

  • Structured logs with correlation ids; durations per step.
  • Basic metrics: files processed, successes/failures, retries, durations.

Acceptance Criteria

  • Meets Steps 1–6 behavior; returns 200 with consolidated JSON minus LAST_MODIFIED when successful; returns 500 on specified failures.
  • Idempotent with unchanged .env LAST_MODIFIED; reprocesses when changed.
  • Parallel processing verified; artifacts created in correct directories.