Which format for my need?
Trådens avsändare: Samuel Murray
Samuel Murray
Samuel Murray  Identity Verified
Nederländerna
Local time: 21:02
Medlem (2006)
Engelska till Afrikaans
+ ...
May 27, 2009

G'day everyone

I have several very old books that I'd like to have in electronic format. I have scanned several of them, but the OCR is only about 95% accurate. This is good enough for making the text searchable, but not good enough for general, overall reference purposes. In other words, it is a good enough accuracy for me to use Ctrl+F to search for something, and in many cases I should be able to deduce the correct spelling of a mis-OCR'ed word (through context), but I would u
... See more
G'day everyone

I have several very old books that I'd like to have in electronic format. I have scanned several of them, but the OCR is only about 95% accurate. This is good enough for making the text searchable, but not good enough for general, overall reference purposes. In other words, it is a good enough accuracy for me to use Ctrl+F to search for something, and in many cases I should be able to deduce the correct spelling of a mis-OCR'ed word (through context), but I would ultimately still need to consult the image file from time to time.

What I want to know is if there is a format that will allow me to have image files as pages but OCR'ed text as searchable content. In other words, if there were such a thing as hidden text in a PDF, then I would have each image as a PDF page and simply put the OCR'ed text of that page as hidden text, so that the page comes up in a search but I would still need to read the page like a paper page.

My current solution is to use a text-like format that allows hyperlinks in it, and then I simply ensure that every page contains a link to the relevant image file. Then I can use Ctrl+F to do a search, and when I find something, I can click the hyperlink, which launches the relevant image file in the default image viewer. This is a workable solution, though far from ideal.

Can what I've described above with PDF be done in PDF at all? Or is there another format that will allow this sort of thing?

Thanks!
Samuel
Collapse


 
Andreas Nieckele
Andreas Nieckele  Identity Verified
Brasilien
Local time: 16:02
Engelska till Portugisiska
Maybe May 27, 2009

You could try usign InDesign or your favorite desktop publishing program and do the following:

- Create a new page, and place a text box containing the text for that page
- On top of this text box, place the image from the page so that it completely blocks the text
- Generate a pdf

I've never tried to do this myself, but I guess in theory it should work.


 
Adam Łobatiuk
Adam Łobatiuk  Identity Verified
Polen
Local time: 21:02
Medlem (2009)
Engelska till Polska
+ ...
Adobe Acrobat (Professional) has an OCR feature May 27, 2009

And it works surprisingly well. You see scanned pages, but the text in the PDF is searchable. See here: http://www.llrx.com/features/adobe8.htm

 
Narcis Lozano Drago
Narcis Lozano Drago  Identity Verified
Spanien
Local time: 21:02
Medlem (2007)
Engelska till Spanska
+ ...
PDF with transparent text May 27, 2009

The OCR software I use has this option. You can load a PDF, perform the recognition and then export it as a PDF file, with the page saved as an image, with the text hidden, which you can easily search in Adobe Reader.

Unfortunately, this software (E.Typist, http://mediadrive.jp/products/et/ ) is in Japanese (recognition is multilingual, though). But I guess that there may be other OCR sof
... See more
The OCR software I use has this option. You can load a PDF, perform the recognition and then export it as a PDF file, with the page saved as an image, with the text hidden, which you can easily search in Adobe Reader.

Unfortunately, this software (E.Typist, http://mediadrive.jp/products/et/ ) is in Japanese (recognition is multilingual, though). But I guess that there may be other OCR software out there with the same functionality.


Narcis
Collapse


 
Jing Nie
Jing Nie
Kina
Local time: 03:02
Medlem (2011)
Engelska till Kinesiska
+ ...
I agree May 27, 2009

Adam Łobatiuk wrote:

And it works surprisingly well. You see scanned pages, but the text in the PDF is searchable. See here: http://www.llrx.com/features/adobe8.htm


you may convert scanned images into PDF. Then you can use the OCR function in the acrobat, thus it will not change the font and layout , and all text will be searchable.


 
Erik Freitag
Erik Freitag  Identity Verified
Tyskland
Local time: 21:02
Medlem (2006)
Nederländska till Tyska
+ ...
Abby FineReader May 28, 2009

Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.

 
Ahmet Murati
Ahmet Murati  Identity Verified
Tyskland
Engelska till Albanska
+ ...
So far the best tool for OCR is Abby Fineread May 28, 2009

efreitag wrote:

Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.


but the resolution of the scanned book should be more that 300 pixels due to better image to be ORCed


 
Sergei Leshchinsky
Sergei Leshchinsky  Identity Verified
Ukraina
Local time: 22:02
Medlem (2008)
Engelska till Ryska
+ ...
make two files of each May 28, 2009

Samuel Murray wrote:
What I want to know is if there is a format that will allow me to have image files as pages but OCR'ed text as searchable content.


1) Scan and save to DJVU to have the pages as images. DJVU compression is about 7 times as compact as PDF plus it preserves the paper colour, as it treats the image as several layers: paper, pictures, and text that are processed in different ways.
2) Make a plain TXT for search.


 
Samuel Murray
Samuel Murray  Identity Verified
Nederländerna
Local time: 21:02
Medlem (2006)
Engelska till Afrikaans
+ ...
TOPIC STARTER
Where can I find a DJVU generator? May 28, 2009

Sergei Leshchinsky wrote:
1) Scan and save to DJVU to have the pages as images. DJVU compression is about 7 times as compact as PDF plus it preserves the paper colour, as it treats the image as several layers: paper, pictures, and text that are processed in different ways.


I'm well aware of DJVU but I have yet to find a DJVU generator that is not experimental. Do you know of stable, usable DJVU generators?


 
Samuel Murray
Samuel Murray  Identity Verified
Nederländerna
Local time: 21:02
Medlem (2006)
Engelska till Afrikaans
+ ...
TOPIC STARTER
Best resolution for OCR May 28, 2009

Ahmet Murati wrote:
efreitag wrote:
Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.

But the resolution of the scanned book should be more that 300 pixels due to better image to be ORCed.


I have found that 300 DPI is about the most economical resolution for OCR. Scanning at 450 DPI will result in an image twice as large, and scanning it will also take twice as long. Scanning at 600 DPI will result in an image four times as large and scanning it will take four times as long.

My scanner takes about 30 seconds to scan an A5 page (using the document feeder) at 300 DPI.

My experience (anecdotal) is that I gain about one or two percent in accuracy with 450 DPI as opposed to 300 DPI, and only about another half a percent in accuracy with 600 DPI, so it really aint worth scanning at resolutions higher than 300 DPI.

I also find that there is very little difference in accuracy whether I scan in full colour or in straight black and white, except if the printed page contains halftone backgrounds, in which case strangely the black and white scan OCRs better (I would have thought the other way round makes more sense). Even so, I find it best to scan in full colour and leave it up to the OCR program to posterise the image if it wants to.

I'm a little annoyed that my scanner has a white background and not black, for black would be somewhat easier to autocrop.

Anyway, I was aware of ABBYY's PDF creator but I did not realise that it can combine text with images. I'll experiment a bit.


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Which format for my need?






CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »