HOCR rendering compares unfavorably with tesseract PDF text layer

Using recode_pdf (internetarchivepdf 1.5.2) and tesseract (5.3.0). 

I have three examples single-pages, where I: 

1.  have tesseract make a full PDF from OCR, via eg `tesseract identifier.tiff identifier.tesseract -l eng pdf`
2. Have tesseract output HOCR, and feed the HOCR and TIFF to recode_pdf, via eg:
    * `tesseract identifier.tiff identifier.tesseract  -l eng hocr`
    * `recode_pdf --bg-downsample 3 --from-imagestack identifier.tiff --hocr-file  identifier.tesseract.hocr -o identifier.recode_pdf.pdf`

I am finding that the text layout by the second process involving recode_pdf is not identical, and is inferior to, the text layout tesseract produces itself.  I have put all my sample files on an S3 bucket for investigation, although I don't know if they will stay there forever. 

## Simple page
A simple, clear textual  book page
* [original tiff](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/dhc6a4r.tiff)
* [tesseract-generated pdf](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/dhc6a4r.tesseract.pdf)
* [tesseract-generated hocr](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/dhc6a4r.tesseract.hocr)
* [recode_pdf-generated pdf](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/dhc6a4r.recode_pdf.pdf)

if you select text in the PDF, the recode_pdf one has much smaller height than the tesseract one, the selection bar does not go all the way to top of ascenders like it does with tesseract one.  

In this case, the recode_pdf version is perfectly usable, but it demonstrates not identical. 


![Screen Shot 2023-03-28 at 2 14 31 PM](https://user-images.githubusercontent.com/149304/228330502-6497d3fa-f371-4666-9e78-df2e865e5396.png)
![Screen Shot 2023-03-28 at 2 14 40 PM](https://user-images.githubusercontent.com/149304/228330496-fb64b48c-66df-4f65-8fc7-2fe0cdb226f7.png)

## Somewhat more complicated page

This one is also a book page, but has a figure in the middle of the page interupting text, some background coloration, and the photography was not perfectly squared so text is somewhat diagonal. 

* [original tiff](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/wg8ie02.tiff)
* [tesseract-generated pdf](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/wg8ie02.tesseract.pdf)
* [tesseract-generated hocr](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/wg8ie02.tesseract.hocr)
* [recode_pdf-generated pdf](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/wg8ie02.recode_pdf.pdf)

in this one, the recode_pdf-generated textual data all seems to have double-height, making it very confusing to select text, and making highlights on search-within-the-pdf results also very confusing, a definite usability issue.   I only have one line of text selected in these screenshots. 

![Screen Shot 2023-03-28 at 2 19 35 PM](https://user-images.githubusercontent.com/149304/228331659-57a275db-a37a-4a9b-93ff-4ead80c2d8b0.png)
![Screen Shot 2023-03-28 at 2 19 15 PM](https://user-images.githubusercontent.com/149304/228331661-c08ef0b1-0899-47fb-9386-559e266e2241.png)

## More complex graphical page

This is a graphical advertisement that only has a little bit of text on it, at various places and in various fonts. 

* [original tiff](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/2y60cl2.tiff)
* [tesseract-generated pdf](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/2y60cl2.tesseract.pdf)
* [tesseract-generated HOCR](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/2y60cl2.tesseract.hocr)
* [recode_pdf-generated PDF](https://scihist-digicoll-staging-public.s3.amazonaws.com/archive-pdf-tools-exhibit/2y60cl2.recode_pdf.pdf)


This one is harder to explain/demonstrate. And for me only reproduces the problem in MacOS Preview. 

If I open the recode_pdf PDF in MacOS Preview (with "Live Text" **disabled**, yup), and try to drag to select the line at "effective residual deposit", I can't select the whole line -- the layout of the text is leading the PDF reader to think there is a column there or something. 

This one does not reproduce in Chrome PDF viewer, selection works okay there. But reproduces in MacOS preview, these screen recordings are from there. (I disable "Live Text" in my MacOS settings to ensure that what I'm seeing in Preview is embedded text data from PDF only, not OCR that MacOS Preview does itself on-the-fly under the branding "Live Text"!) I realize text-order in PDFs is a heuristic applied by the viewer, but this demonstrates that something in the layout was different -- and the layout from tesseract led to succesful heuristic in MacOS Preview, and the one from here did not. 

![Screen Recording 2023-03-28 at 2 31 42 PM](https://user-images.githubusercontent.com/149304/228335362-f4f57b6a-0cc0-4eac-8479-9b07def5c650.gif)
![Screen Recording 2023-03-28 at 2 31 08 PM](https://user-images.githubusercontent.com/149304/228335361-8b5f326d-6107-4d7c-b607-15efbf5dcffb.gif)


## What's going on?

I know you originally ported the HOCR rendering from tesseract.  Brainstorming....  

* could there be bug(s) in the port?
* could tesseract have changed it's implementation after the port?
* could tesseract have changed what it outputs as HOCR? And/or is what tesseract outputs as HOCR actually not what it uses internally to position text directly from a tesseract OCR operation into PDF?  Is the HOCR somehow missing info that tesseract has from it's own OCR that it uses to position the text better?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HOCR rendering compares unfavorably with tesseract PDF text layer #63

Simple page

Somewhat more complicated page

More complex graphical page

What's going on?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HOCR rendering compares unfavorably with tesseract PDF text layer #63

Description

Simple page

Somewhat more complicated page

More complex graphical page

What's going on?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions