Which format for my need? Trådens avsändare: Samuel Murray
| Samuel Murray Nederländerna Local time: 21:02 Medlem (2006) Engelska till Afrikaans + ...
G'day everyone
I have several very old books that I'd like to have in electronic format. I have scanned several of them, but the OCR is only about 95% accurate. This is good enough for making the text searchable, but not good enough for general, overall reference purposes. In other words, it is a good enough accuracy for me to use Ctrl+F to search for something, and in many cases I should be able to deduce the correct spelling of a mis-OCR'ed word (through context), but I would u... See more G'day everyone
I have several very old books that I'd like to have in electronic format. I have scanned several of them, but the OCR is only about 95% accurate. This is good enough for making the text searchable, but not good enough for general, overall reference purposes. In other words, it is a good enough accuracy for me to use Ctrl+F to search for something, and in many cases I should be able to deduce the correct spelling of a mis-OCR'ed word (through context), but I would ultimately still need to consult the image file from time to time.
What I want to know is if there is a format that will allow me to have image files as pages but OCR'ed text as searchable content. In other words, if there were such a thing as hidden text in a PDF, then I would have each image as a PDF page and simply put the OCR'ed text of that page as hidden text, so that the page comes up in a search but I would still need to read the page like a paper page.
My current solution is to use a text-like format that allows hyperlinks in it, and then I simply ensure that every page contains a link to the relevant image file. Then I can use Ctrl+F to do a search, and when I find something, I can click the hyperlink, which launches the relevant image file in the default image viewer. This is a workable solution, though far from ideal.
Can what I've described above with PDF be done in PDF at all? Or is there another format that will allow this sort of thing?
Thanks!
Samuel ▲ Collapse | | |
You could try usign InDesign or your favorite desktop publishing program and do the following:
- Create a new page, and place a text box containing the text for that page
- On top of this text box, place the image from the page so that it completely blocks the text
- Generate a pdf
I've never tried to do this myself, but I guess in theory it should work. | | | Adam Łobatiuk Polen Local time: 21:02 Medlem (2009) Engelska till Polska + ... Adobe Acrobat (Professional) has an OCR feature | May 27, 2009 |
And it works surprisingly well. You see scanned pages, but the text in the PDF is searchable. See here: http://www.llrx.com/features/adobe8.htm | | | PDF with transparent text | May 27, 2009 |
The OCR software I use has this option. You can load a PDF, perform the recognition and then export it as a PDF file, with the page saved as an image, with the text hidden, which you can easily search in Adobe Reader.
Unfortunately, this software (E.Typist, http://mediadrive.jp/products/et/ ) is in Japanese (recognition is multilingual, though). But I guess that there may be other OCR sof... See more The OCR software I use has this option. You can load a PDF, perform the recognition and then export it as a PDF file, with the page saved as an image, with the text hidden, which you can easily search in Adobe Reader.
Unfortunately, this software (E.Typist, http://mediadrive.jp/products/et/ ) is in Japanese (recognition is multilingual, though). But I guess that there may be other OCR software out there with the same functionality.
Narcis ▲ Collapse | |
|
|
Jing Nie Kina Local time: 03:02 Medlem (2011) Engelska till Kinesiska + ...
you may convert scanned images into PDF. Then you can use the OCR function in the acrobat, thus it will not change the font and layout , and all text will be searchable. | | | Erik Freitag Tyskland Local time: 21:02 Medlem (2006) Nederländska till Tyska + ... Abby FineReader | May 28, 2009 |
Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF. | | | So far the best tool for OCR is Abby Fineread | May 28, 2009 |
efreitag wrote:
Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.
but the resolution of the scanned book should be more that 300 pixels due to better image to be ORCed | | | make two files of each | May 28, 2009 |
Samuel Murray wrote:
What I want to know is if there is a format that will allow me to have image files as pages but OCR'ed text as searchable content.
1) Scan and save to DJVU to have the pages as images. DJVU compression is about 7 times as compact as PDF plus it preserves the paper colour, as it treats the image as several layers: paper, pictures, and text that are processed in different ways.
2) Make a plain TXT for search. | |
|
|
Samuel Murray Nederländerna Local time: 21:02 Medlem (2006) Engelska till Afrikaans + ... TOPIC STARTER Where can I find a DJVU generator? | May 28, 2009 |
Sergei Leshchinsky wrote:
1) Scan and save to DJVU to have the pages as images. DJVU compression is about 7 times as compact as PDF plus it preserves the paper colour, as it treats the image as several layers: paper, pictures, and text that are processed in different ways.
I'm well aware of DJVU but I have yet to find a DJVU generator that is not experimental. Do you know of stable, usable DJVU generators? | | | Samuel Murray Nederländerna Local time: 21:02 Medlem (2006) Engelska till Afrikaans + ... TOPIC STARTER Best resolution for OCR | May 28, 2009 |
Ahmet Murati wrote:
efreitag wrote:
Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.
But the resolution of the scanned book should be more that 300 pixels due to better image to be ORCed.
I have found that 300 DPI is about the most economical resolution for OCR. Scanning at 450 DPI will result in an image twice as large, and scanning it will also take twice as long. Scanning at 600 DPI will result in an image four times as large and scanning it will take four times as long.
My scanner takes about 30 seconds to scan an A5 page (using the document feeder) at 300 DPI.
My experience (anecdotal) is that I gain about one or two percent in accuracy with 450 DPI as opposed to 300 DPI, and only about another half a percent in accuracy with 600 DPI, so it really aint worth scanning at resolutions higher than 300 DPI.
I also find that there is very little difference in accuracy whether I scan in full colour or in straight black and white, except if the printed page contains halftone backgrounds, in which case strangely the black and white scan OCRs better (I would have thought the other way round makes more sense). Even so, I find it best to scan in full colour and leave it up to the OCR program to posterise the image if it wants to.
I'm a little annoyed that my scanner has a white background and not black, for black would be somewhat easier to autocrop.
Anyway, I was aware of ABBYY's PDF creator but I did not realise that it can combine text with images. I'll experiment a bit. | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Which format for my need? CafeTran Espresso | You've never met a CAT tool this clever!
Translate faster & easier, using a sophisticated CAT tool built by a translator / developer.
Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools.
Download and start using CafeTran Espresso -- for free
Buy now! » |
| TM-Town | Manage your TMs and Terms ... and boost your translation business
Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |