-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Using recode_pdf (internetarchivepdf 1.5.2) and tesseract (5.3.0).
I have three examples single-pages, where I:
- have tesseract make a full PDF from OCR, via eg
tesseract identifier.tiff identifier.tesseract -l eng pdf - Have tesseract output HOCR, and feed the HOCR and TIFF to recode_pdf, via eg:
tesseract identifier.tiff identifier.tesseract -l eng hocrrecode_pdf --bg-downsample 3 --from-imagestack identifier.tiff --hocr-file identifier.tesseract.hocr -o identifier.recode_pdf.pdf
I am finding that the text layout by the second process involving recode_pdf is not identical, and is inferior to, the text layout tesseract produces itself. I have put all my sample files on an S3 bucket for investigation, although I don't know if they will stay there forever.
Simple page
A simple, clear textual book page
if you select text in the PDF, the recode_pdf one has much smaller height than the tesseract one, the selection bar does not go all the way to top of ascenders like it does with tesseract one.
In this case, the recode_pdf version is perfectly usable, but it demonstrates not identical.
Somewhat more complicated page
This one is also a book page, but has a figure in the middle of the page interupting text, some background coloration, and the photography was not perfectly squared so text is somewhat diagonal.
in this one, the recode_pdf-generated textual data all seems to have double-height, making it very confusing to select text, and making highlights on search-within-the-pdf results also very confusing, a definite usability issue. I only have one line of text selected in these screenshots.
More complex graphical page
This is a graphical advertisement that only has a little bit of text on it, at various places and in various fonts.
This one is harder to explain/demonstrate. And for me only reproduces the problem in MacOS Preview.
If I open the recode_pdf PDF in MacOS Preview (with "Live Text" disabled, yup), and try to drag to select the line at "effective residual deposit", I can't select the whole line -- the layout of the text is leading the PDF reader to think there is a column there or something.
This one does not reproduce in Chrome PDF viewer, selection works okay there. But reproduces in MacOS preview, these screen recordings are from there. (I disable "Live Text" in my MacOS settings to ensure that what I'm seeing in Preview is embedded text data from PDF only, not OCR that MacOS Preview does itself on-the-fly under the branding "Live Text"!) I realize text-order in PDFs is a heuristic applied by the viewer, but this demonstrates that something in the layout was different -- and the layout from tesseract led to succesful heuristic in MacOS Preview, and the one from here did not.
What's going on?
I know you originally ported the HOCR rendering from tesseract. Brainstorming....
- could there be bug(s) in the port?
- could tesseract have changed it's implementation after the port?
- could tesseract have changed what it outputs as HOCR? And/or is what tesseract outputs as HOCR actually not what it uses internally to position text directly from a tesseract OCR operation into PDF? Is the HOCR somehow missing info that tesseract has from it's own OCR that it uses to position the text better?





