- Build an API that processes PDF financial reports and returns consolidated JSON, meeting the PRD’s functional, performance, and reliability criteria.
- Implement idempotent, cache-aware batch processing with robust error handling.
- End-to-end implementation of Steps 1–6 in the PRD, including directory/file validation, Gemini integrations, PDF appendix extraction, per-file JSON outputs, consolidation, and API response.
- Automated tests (unit + integration + e2e with fixtures), logs/metrics, and developer ergonomics.
- API layer: A single endpoint (GET or POST) that orchestrates the processing pipeline.
- Orchestrator/service: Coordinates Steps 1–6, parallelizes per-file work, and ensures completion synchronization.
- Integrations:
- Gemini API: appendix range detection and field extraction.
- PDF processing library: page extraction to build appendix-only PDFs.
- Storage (local FS):
./reports: input PDFs./preprocessing: appendix-only PDFs./processed: per-file JSONs andresult.json./config/values.json,.env(requiresLAST_MODIFIED)
- M1: Project bootstrap
- Project skeleton, env/config loading, logging, error model, basic tests.
- M2: Core pipeline
- Directory/file validation (2.2), Step 1 timestamp check, parallel fan-out (2.1.2), per-file sequential sub-steps (2.1.3), completion sync (2.1.4).
- M3: Consolidation + API
- Merge per-file JSONs, add
last_modified, implement endpoint response (omitLAST_MODIFIED), end-to-end tests.
- Merge per-file JSONs, add
- M4: Hardening
- Error handling per PRD, retry/backoff for Gemini, performance tuning, observability.
- Setup & Foundations
- Choose stack: e.g., Node.js (Express/Fastify) or Python (FastAPI). Add dependency manager, scripts, lint/format/test.
- Config: Load
.env(must includeLAST_MODIFIED); validate presence of./config/values.json. - Dirs: Ensure creation of
./reports,./preprocessing,./processed,./configon startup and/or per-request. - Logging: Structured logs (request id, file name, step, duration). Add basic metrics timers.
- Step 1: Initial Validation (Cache)
- Read
./processed/result.jsonif present. - Compare its
LAST_MODIFIEDwith.envLAST_MODIFIED.- If equal: short-circuit to Step 6 (serve cleaned
result.json). - If different/missing: proceed to Step 2.
- If equal: short-circuit to Step 6 (serve cleaned
- Step 2: Parallel File Processing (Fan-out)
- Enumerate PDFs in
./reports. - For each file, launch a task in a bounded worker pool (configurable concurrency).
- Step 3: Per-file Sequential Steps (In-task)
- 3.1 Processed check: If
./processed/[filename].jsonexists, skip to 3.6. - 3.2 Preprocessing check: If
./preprocessing/[filename]exists, skip to 3.5. - 3.3 Appendix detection (Gemini): Ask for appendix start/end pages.
- 3.4 PDF extraction: Use PDF library to slice pages into
./preprocessing/[filename]. - 3.5 Data extraction (Gemini +
values.json): Request structured JSON per schema; no citations. - 3.6 Save as
./processed/[filename].json(valid JSON). Validate includes company name.
- Step 4: Completion Synchronization
- Await all tasks; verify expected JSON files exist for all input PDFs.
- Step 5: Result Consolidation
- Merge all
./processed/*.jsoninto one object:{ [companyName]: data, last_modified: <.env LAST_MODIFIED> }. - Save as
./processed/result.json.
- Step 6: API Response
- Read
./processed/result.json, omitLAST_MODIFIEDin response body, return 200 JSON.
- File system: Missing required files/dirs, permissions, disk space ⇒ 500.
- AI/Gemini: API failures, invalid PDF, extraction errors ⇒ 500. Add limited retries with backoff and clear error messages.
- Data processing: JSON parse or invalid config ⇒ 500.
- Ensure errors short-circuit the batch with a single 500 response.
- Bounded parallelism for Step 2 tasks; sequential sub-steps within each task.
- Reuse preprocessed and processed artifacts when present to avoid recomputation.
- Avoid large in-memory buffers; stream PDFs where possible.
.env:LAST_MODIFIED, Gemini credentials, concurrency limits, timeouts../config/values.json: validated at startup; schema-checked to avoid runtime surprises.
- Unit: utilities (env loader, file discovery, JSON merge, filters).
- Integration: stub Gemini client; golden PDFs and expected JSONs; PDF extraction round-trip.
- E2E: fixture PDFs in
./reports, run endpoint, assertresult.jsonand response shape/idempotency. - Negative cases: missing files, bad PDFs, Gemini failures.
- Structured logs with correlation ids; durations per step.
- Basic metrics: files processed, successes/failures, retries, durations.
- Meets Steps 1–6 behavior; returns 200 with consolidated JSON minus
LAST_MODIFIEDwhen successful; returns 500 on specified failures. - Idempotent with unchanged
.env LAST_MODIFIED; reprocesses when changed. - Parallel processing verified; artifacts created in correct directories.