Skip to content

Richiio/DocumentOCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document OCR

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    REST API Layer                           │
│                   (FastAPI Server)                          │
└────────────────────┬────────────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────────────┐
│              DocumentOCRPipeline (Orchestrator)             │
└────────────────────┬────────────────────────────────────────┘
                     │
        ┌────────────┼────────────┬──────────────┐
        │            │            │              │
┌─────── ──────┐ ┌── ──────┐ ┌── ──────┐ ┌───── ──────┐
│ Image        │ │  OCR    │ │Metadata │ │ Confidence │
│Preprocessing │ │ Engine  │ │Extractor│ │Assessment  │
│(OpenCV)      │ │(Paddle) │ │(Regex)  │ │            │
└──────────────┘ └─────────┘ └─────────┘ └────────────┘

Technology Stack

Component Technology
OCR Engine PaddleOCR (primary) + Tesseract (fallback)
Image Processing OpenCV
Metadata Extraction Regex + Pattern matching
Framework FastAPI
Validation Pydantic
Testing Pytest

Installation

Prerequisites

  • Python 3.8+
  • macOS/Linux/Windows
  • 2GB RAM minimum (4GB+ recommended)

Setup

  1. Clone repository:
cd /Users/richiio/Desktop/DocumentOCR
  1. Create virtual environment:
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Tesseract:
# macOS
brew install tesseract

# Linux
sudo apt-get install tesseract-ocr

# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

Confidence Scoring

The system provides multi-level confidence assessment:

  1. OCR Confidence: Per-text-region confidence (0.0-1.0)
  2. Field Confidence: Based on pattern matching strength
  3. Overall Confidence: Weighted combination

Error Handling

Strategy

Primary OCR (PaddleOCR)
    ↓ [Success] → Return results
    ↓ [Failure]
    ↓
Fallback OCR (Tesseract)
    ↓ [Success] → Return results
    ↓ [Failure]
    ↓
Return error with details

Error Types

  • File errors: Invalid path, unsupported format
  • OCR errors: Engine failure, corrupted image
  • Processing errors: Preprocessing failure
  • Extraction errors: Pattern matching issues

All errors are logged and included in results for debugging.

Directory Structure

DocumentOCR/
├── src/
│   ├── __init__.py
│   ├── pipeline.py              # Main orchestrator
│   ├── preprocessing/
│   │   └── image_processor.py   # OpenCV preprocessing
│   ├── ocr/
│   │   └── engine.py            # PaddleOCR + Tesseract
│   ├── extraction/
│   │   └── metadata_extractor.py # Regex-based extraction
│   └── api/
│       └── server.py            # FastAPI endpoints
├── tests/
│   └── test_pipeline.py         # Comprehensive tests
├── benchmarks/
│   └── benchmark_suite.py       # Performance benchmarks
├── sample_documents/            # Example documents
├── results/                     # Output directory
├── examples.py                  # Usage examples
├── requirements.txt             # Python dependencies
└── README.md                    # This file

Contributing

Improvements welcome! Key areas:

  • Add support for more document types (forms, receipts)
  • Add handwriting OCR support
  • Multi-language benchmarking

License

Open source - use freely in commercial projects.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors