Document OCR

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    REST API Layer                           │
│                   (FastAPI Server)                          │
└────────────────────┬────────────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────────────┐
│              DocumentOCRPipeline (Orchestrator)             │
└────────────────────┬────────────────────────────────────────┘
                     │
        ┌────────────┼────────────┬──────────────┐
        │            │            │              │
┌─────── ──────┐ ┌── ──────┐ ┌── ──────┐ ┌───── ──────┐
│ Image        │ │  OCR    │ │Metadata │ │ Confidence │
│Preprocessing │ │ Engine  │ │Extractor│ │Assessment  │
│(OpenCV)      │ │(Paddle) │ │(Regex)  │ │            │
└──────────────┘ └─────────┘ └─────────┘ └────────────┘

Technology Stack

Component	Technology
OCR Engine	PaddleOCR (primary) + Tesseract (fallback)
Image Processing	OpenCV
Metadata Extraction	Regex + Pattern matching
Framework	FastAPI
Validation	Pydantic
Testing	Pytest

Installation

Prerequisites

Python 3.8+
macOS/Linux/Windows
2GB RAM minimum (4GB+ recommended)

Setup

Clone repository:

cd /Users/richiio/Desktop/DocumentOCR

Create virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Install Tesseract:

# macOS
brew install tesseract

# Linux
sudo apt-get install tesseract-ocr

# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

Confidence Scoring

The system provides multi-level confidence assessment:

OCR Confidence: Per-text-region confidence (0.0-1.0)
Field Confidence: Based on pattern matching strength
Overall Confidence: Weighted combination

Error Handling

Strategy

Primary OCR (PaddleOCR)
    ↓ [Success] → Return results
    ↓ [Failure]
    ↓
Fallback OCR (Tesseract)
    ↓ [Success] → Return results
    ↓ [Failure]
    ↓
Return error with details

Error Types

File errors: Invalid path, unsupported format
OCR errors: Engine failure, corrupted image
Processing errors: Preprocessing failure
Extraction errors: Pattern matching issues

All errors are logged and included in results for debugging.

Directory Structure

DocumentOCR/
├── src/
│   ├── __init__.py
│   ├── pipeline.py              # Main orchestrator
│   ├── preprocessing/
│   │   └── image_processor.py   # OpenCV preprocessing
│   ├── ocr/
│   │   └── engine.py            # PaddleOCR + Tesseract
│   ├── extraction/
│   │   └── metadata_extractor.py # Regex-based extraction
│   └── api/
│       └── server.py            # FastAPI endpoints
├── tests/
│   └── test_pipeline.py         # Comprehensive tests
├── benchmarks/
│   └── benchmark_suite.py       # Performance benchmarks
├── sample_documents/            # Example documents
├── results/                     # Output directory
├── examples.py                  # Usage examples
├── requirements.txt             # Python dependencies
└── README.md                    # This file

Contributing

Improvements welcome! Key areas:

Add support for more document types (forms, receipts)
Add handwriting OCR support
Multi-language benchmarking

License

Open source - use freely in commercial projects.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
benchmarks		benchmarks
sample_documents		sample_documents
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
API_GUIDE.md		API_GUIDE.md
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
README.md		README.md
Screenshot 2026-01-24 at 12.04.24.png		Screenshot 2026-01-24 at 12.04.24.png
Screenshot 2026-01-24 at 12.04.33.png		Screenshot 2026-01-24 at 12.04.33.png
config.py		config.py
generate_samples.py		generate_samples.py
main.py		main.py
quickstart.sh		quickstart.sh
railway.json		railway.json
requirements.txt		requirements.txt
test_image.jpg		test_image.jpg
test_image.png		test_image.png
test_invoice.pdf		test_invoice.pdf
test_invoice.txt		test_invoice.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document OCR

Architecture Overview

Technology Stack

Installation

Prerequisites

Setup

Confidence Scoring

Error Handling

Strategy

Error Types

Directory Structure

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document OCR

Architecture Overview

Technology Stack

Installation

Prerequisites

Setup

Confidence Scoring

Error Handling

Strategy

Error Types

Directory Structure

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages