Skip to content

unicef/hope-documents

Repository files navigation

Hope Documents

Test Lint codecov Documentation Pypi

Hope Documents is a Django-based application for automated document scanning and text analysis. Its primary purpose is to validate document number consistency, ensuring that identifiers (such as IDs, registration numbers, or codes) detected in document images match the expected reference values.

The system processes uploaded images, extracts textual information using OCR, and searches for the provided number or ID within the document’s contents. This enables reliable, automated verification of official documents such as ID cards, invoices, and certificates.

Features

  • OCR Processing: Extracts text from various image formats and PDF files.
  • Command-Line Interface: Easy-to-use CLI for batch processing of documents.
  • Configurable OCR Engine: Options to fine-tune Tesseract and OpenCV parameters for better results.
  • Django Integration: Seamlessly integrates with Django projects.

Installation

To install Hope Documents, you can use pip:

pip install hope-documents

Usage

Hope Documents provides a command-line interface to extract text from documents.

extract [OPTIONS] FILEPATHS...

Arguments

  • FILEPATHS...: One or more paths to files or directories to process.

Options

  • -a, --auto: Automatic mode.
  • -t, --threshold: CV2 threshold value (0-255). Default is 128.
  • -p, --psm: Tesseract Page Segmentation Mode (0-13). Default is 11.
  • -o, --oem: Tesseract OCR Engine Mode (0-3). Default is 3.
  • -n, --number-only: Only extract numbers from the documents.
  • --debug: Enable debug mode.

Example

extract tests/images/ita/ci1.png

This will output the extracted text from the ci1.png image.

Development

To set up the development environment, you need to have tox installed.

Running Tests

To run the test suite, use the following command:

tox -e d52-py313

Linting

To check for code style and linting errors, run:

tox lint

Documentation

To build the documentation, use the following command:

tox docs

The documentation will be generated in the site directory.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributors 2

  •  
  •