Skip to content

Bilingual Text Encoding is not Working for Kannada-English Output Hocr FileΒ #176

@vaibhavsanil

Description

@vaibhavsanil

I am facing issues with hocr pdf conversion for English Kannada encoded into the text layer of the PDF File

I have a image below in kannada language
(https://drive.google.com/file/d/11P2XMFWjmc0S6rzfOX58UtZZJkG2StNI/view?usp=sharing)

following is the corresponding output hocr of the file
https://drive.google.com/file/d/1wm-40rCN_rSE4cqT499kZAjAs5y6A3xl/view?usp=sharing

following is output of the gcv ocr for the particular file in JSON
OCR Output in JSON

The output of hocr-pdf conversion is as follows
Hocr-PDF output

As you can see if you search for english words it will highlight ,but for kannada language its giving gibberish results in the output file generated using hocr-pdf conversion

Any guidance in this regards is appreciated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions