┌─────────────────────────────────────────────────────────────┐
│ REST API Layer │
│ (FastAPI Server) │
└────────────────────┬────────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────────┐
│ DocumentOCRPipeline (Orchestrator) │
└────────────────────┬────────────────────────────────────────┘
│
┌────────────┼────────────┬──────────────┐
│ │ │ │
┌─────── ──────┐ ┌── ──────┐ ┌── ──────┐ ┌───── ──────┐
│ Image │ │ OCR │ │Metadata │ │ Confidence │
│Preprocessing │ │ Engine │ │Extractor│ │Assessment │
│(OpenCV) │ │(Paddle) │ │(Regex) │ │ │
└──────────────┘ └─────────┘ └─────────┘ └────────────┘
| Component | Technology |
|---|---|
| OCR Engine | PaddleOCR (primary) + Tesseract (fallback) |
| Image Processing | OpenCV |
| Metadata Extraction | Regex + Pattern matching |
| Framework | FastAPI |
| Validation | Pydantic |
| Testing | Pytest |
- Python 3.8+
- macOS/Linux/Windows
- 2GB RAM minimum (4GB+ recommended)
- Clone repository:
cd /Users/richiio/Desktop/DocumentOCR- Create virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Install Tesseract:
# macOS
brew install tesseract
# Linux
sudo apt-get install tesseract-ocr
# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wikiThe system provides multi-level confidence assessment:
- OCR Confidence: Per-text-region confidence (0.0-1.0)
- Field Confidence: Based on pattern matching strength
- Overall Confidence: Weighted combination
Primary OCR (PaddleOCR)
↓ [Success] → Return results
↓ [Failure]
↓
Fallback OCR (Tesseract)
↓ [Success] → Return results
↓ [Failure]
↓
Return error with details
- File errors: Invalid path, unsupported format
- OCR errors: Engine failure, corrupted image
- Processing errors: Preprocessing failure
- Extraction errors: Pattern matching issues
All errors are logged and included in results for debugging.
DocumentOCR/
├── src/
│ ├── __init__.py
│ ├── pipeline.py # Main orchestrator
│ ├── preprocessing/
│ │ └── image_processor.py # OpenCV preprocessing
│ ├── ocr/
│ │ └── engine.py # PaddleOCR + Tesseract
│ ├── extraction/
│ │ └── metadata_extractor.py # Regex-based extraction
│ └── api/
│ └── server.py # FastAPI endpoints
├── tests/
│ └── test_pipeline.py # Comprehensive tests
├── benchmarks/
│ └── benchmark_suite.py # Performance benchmarks
├── sample_documents/ # Example documents
├── results/ # Output directory
├── examples.py # Usage examples
├── requirements.txt # Python dependencies
└── README.md # This file
Improvements welcome! Key areas:
- Add support for more document types (forms, receipts)
- Add handwriting OCR support
- Multi-language benchmarking
Open source - use freely in commercial projects.