Which format for my need? (Software applications)

Tekniska forum » Software applications »
Which format for my need?
Track this topic

Which format for my need?

Trådens avsändare: Samuel Murray

Samuel Murray

Nederländerna
Local time: 21:02
Medlem (2006)
Engelska till Afrikaans
+ ...

May 27, 2009

G'day everyone

I have several very old books that I'd like to have in electronic format. I have scanned several of them, but the OCR is only about 95% accurate. This is good enough for making the text searchable, but not good enough for general, overall reference purposes. In other words, it is a good enough accuracy for me to use Ctrl+F to search for something, and in many cases I should be able to deduce the correct spelling of a mis-OCR'ed word (through context), but I would ultimately still need to consult the image file from time to time.

What I want to know is if there is a format that will allow me to have image files as pages but OCR'ed text as searchable content. In other words, if there were such a thing as hidden text in a PDF, then I would have each image as a PDF page and simply put the OCR'ed text of that page as hidden text, so that the page comes up in a search but I would still need to read the page like a paper page.

My current solution is to use a text-like format that allows hyperlinks in it, and then I simply ensure that every page contains a link to the relevant image file. Then I can use Ctrl+F to do a search, and when I find something, I can click the hyperlink, which launches the relevant image file in the default image viewer. This is a workable solution, though far from ideal.

Can what I've described above with PDF be done in PDF at all? Or is there another format that will allow this sort of thing?

Thanks!
Samuel ▲ Collapse

Andreas Nieckele

Brasilien
Local time: 16:02
Engelska till Portugisiska

Maybe

May 27, 2009

You could try usign InDesign or your favorite desktop publishing program and do the following:

- Create a new page, and place a text box containing the text for that page
- On top of this text box, place the image from the page so that it completely blocks the text
- Generate a pdf

I've never tried to do this myself, but I guess in theory it should work.

Adam Łobatiuk

Polen
Local time: 21:02
Medlem (2009)
Engelska till Polska
+ ...

Adobe Acrobat (Professional) has an OCR feature

May 27, 2009

And it works surprisingly well. You see scanned pages, but the text in the PDF is searchable. See here: http://www.llrx.com/features/adobe8.htm

Narcis Lozano Drago

Spanien
Local time: 21:02
Medlem (2007)
Engelska till Spanska
+ ...

PDF with transparent text

May 27, 2009

The OCR software I use has this option. You can load a PDF, perform the recognition and then export it as a PDF file, with the page saved as an image, with the text hidden, which you can easily search in Adobe Reader.

Unfortunately, this software (E.Typist, http://mediadrive.jp/products/et/ ) is in Japanese (recognition is multilingual, though). But I guess that there may be other OCR sof... See more

Jing Nie
Kina
Local time: 03:02
Medlem (2011)
Engelska till Kinesiska
+ ...

I agree

May 27, 2009

Adam Łobatiuk wrote:

And it works surprisingly well. You see scanned pages, but the text in the PDF is searchable. See here: http://www.llrx.com/features/adobe8.htm

you may convert scanned images into PDF. Then you can use the OCR function in the acrobat, thus it will not change the font and layout , and all text will be searchable.

Erik Freitag

Tyskland
Local time: 21:02
Medlem (2006)
Nederländska till Tyska
+ ...

Abby FineReader

May 28, 2009

Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.

Ahmet Murati

Tyskland
Engelska till Albanska
+ ...

So far the best tool for OCR is Abby Fineread

May 28, 2009

efreitag wrote:

Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.

but the resolution of the scanned book should be more that 300 pixels due to better image to be ORCed

Sergei Leshchinsky

Ukraina
Local time: 22:02
Medlem (2008)
Engelska till Ryska
+ ...

make two files of each

May 28, 2009

Samuel Murray wrote:
What I want to know is if there is a format that will allow me to have image files as pages but OCR'ed text as searchable content.

1) Scan and save to DJVU to have the pages as images. DJVU compression is about 7 times as compact as PDF plus it preserves the paper colour, as it treats the image as several layers: paper, pictures, and text that are processed in different ways.
2) Make a plain TXT for search.

Samuel Murray

Nederländerna
Local time: 21:02
Medlem (2006)
Engelska till Afrikaans
+ ...

TOPIC STARTER

Where can I find a DJVU generator?

May 28, 2009

Sergei Leshchinsky wrote:
1) Scan and save to DJVU to have the pages as images. DJVU compression is about 7 times as compact as PDF plus it preserves the paper colour, as it treats the image as several layers: paper, pictures, and text that are processed in different ways.

I'm well aware of DJVU but I have yet to find a DJVU generator that is not experimental. Do you know of stable, usable DJVU generators?

Samuel Murray

Nederländerna
Local time: 21:02
Medlem (2006)
Engelska till Afrikaans
+ ...

TOPIC STARTER

Best resolution for OCR

May 28, 2009

Ahmet Murati wrote:

efreitag wrote:
Abby FineReader will do what you want and save the OCRed document as a searchable PDF. You can also copy text from the PDF.

But the resolution of the scanned book should be more that 300 pixels due to better image to be ORCed.

I have found that 300 DPI is about the most economical resolution for OCR. Scanning at 450 DPI will result in an image twice as large, and scanning it will also take twice as long. Scanning at 600 DPI will result in an image four times as large and scanning it will take four times as long.

My scanner takes about 30 seconds to scan an A5 page (using the document feeder) at 300 DPI.

My experience (anecdotal) is that I gain about one or two percent in accuracy with 450 DPI as opposed to 300 DPI, and only about another half a percent in accuracy with 600 DPI, so it really aint worth scanning at resolutions higher than 300 DPI.

I also find that there is very little difference in accuracy whether I scan in full colour or in straight black and white, except if the printed page contains halftone backgrounds, in which case strangely the black and white scan OCRs better (I would have thought the other way round makes more sense). Even so, I find it best to scan in full colour and leave it up to the OCR program to posterise the image if it wants to.

I'm a little annoyed that my scanner has a white background and not black, for black would be somewhat easier to autocrop.

Anyway, I was aware of ABBYY's PDF creator but I did not realise that it can combine text with images. I'll experiment a bit.

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderatorer för detta forum
Natalie	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

Which format for my need?

Forum rules

Help and orientation

CafeTran Espresso
You've never met a CAT tool this clever! Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free Buy now! »

TM-Town
Manage your TMs and Terms ... and boost your translation business Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work. More info »

Senaste inläggen | VANLIGA FRÅGOR | Regler | Moderatorer | Artikelkunskapsbas

Your current localization setting

Svenska

Select a language

More languages...

Which format for my need?

Which format for my need?

You have native languages that can be verified

Your current localization setting

Select a language