Skip to content

techytushar/ocr-date-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR Date Extractor

Flask API to extract dates from documents

How to use

The API is provided with 2 routes:

  • If you want to pass Base64 encoded image, send a POST request with payload {"base_64_image_content": <base_64_image_bytes>} to
https://ocr-date-extractor.herokuapp.com/extract_date
  • If you want to pass image file, send a POST request with payload {'image': <image_file>} to
https://ocr-date-extractor.herokuapp.com/extract_date_from_image

Python sample code to test out the API:

  1. Sending the image as Base64 encoded
import requests, base64
img_url = <path_to_image>
with open(img_url, 'rb') as f:
    img = base64.b64encode(f.read())
response = requests.post('https://ocr-date-extractor.herokuapp.com/extract_date', data={'base_64_image_content':img})
print(response.content)
  1. Directly uploading the file
import requests
url = "https://ocr-date-extractor.herokuapp.com/extract_date_from_image"
files=[
  ('image',('document.png',open('/Users/tushar/peak/document.png','rb'),'image/png'))
]
response = requests.post(url, data=payload, files=files)
print(response.text)

Working

The project performs the following steps for any given image:

  • Re-scales the image if its too big in size
  • Performs thresholding to separate foreground (the document) and the background
  • Find contours and draws a bounding box on the document present in the image
  • Crops the image to keep only the document
  • Performs thresholding again to separate text from the background
  • Apply OCR to extract text
  • Use regex to extract out the date
  • Date is then parsed and returned in YYYY-MM-DD format

Supported Date Formats

Following date format are supported with some flexibility:

  • dd-mm-yyyy
  • mm-dd-yyyy
  • yyyy-mm-dd
  • dd/mm/yyyy
  • mm/dd/yyyy
  • yyyy/mm/dd
  • Aug23'19
  • Feb 24, 2019
  • 24 May'19

References

I took help from the following resources:

About

API to extract dates from documents using OCR

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages