Scanning and OCR

ostatni-gdm Content of the lesson:

Document Scanning
Scanning into PDF
Adobe Acrobat and OCR
Text Extraction
Individual Task

Operating the Scanner

The procedure how to operate a scanner will be shown in the following part using one particular type of scanner. However, operating other scanners should be similar. A scanner is connected to a computer usually by a USB cable.

Open the cover and place the document which should be scanned inside the scanner (see the following image). The area which is turned to the glass part will be scanned (similar to photocopier). The principle of a scanner is that it tries to "take a photo" of the document and save it (usually as an image) inside the computer (on a hard drive).

The whole process is simple, after launch the scanning head under the glass passes the whole document using a particular speed (according to the selected quality) and creates a file which will contain everything what was captured - a copy of the document. You should place the document inside a scanner as straight as possible to avoid problems with the final images (you can fix these problems using a graphical program).

The simplest situation is when you scan a sheet of format A4 - you only have to put in inside the scanner and close the cover. When scanning a book or a magazine you do not have to close the cover (you can hold the book as close to the scanner as possible). The final image is better when the document is placed closer to the glass inside scanner.

Rewriting a Paper Document into a Text Processor (for example Word)

In case you want to rewrite a text from a book into a text processor (for example InDesign or Word) you can automate this procedure using computer. You have scan the text (get the image with text) and then let the computer convert characters from the image to a new document. The computer can (more or less) recognize the text (using the OCR function - Optical Character Recognition), find letters and characters in the image and insert them into any text processor.

The following procedure describes how to scan documents and how to use the OCR method; you can download the scanned documents for the following steps here: kniha-sken-1.pdf, kniha-sken-2.pdf

Procedure

Insert the required document into the scanner.
Press the button PDF on the scanner (or you can use the software for your scanner).
The lamp inside your scanner should light up and the software for this scanner inside your computer should be launched. The heating process of the lamp was started and will last several minutes (you can see a small window with text Scanning on the display ... and with the text "Adjusting the lamp. Please keep the document cover closed."). You have to wait until this process is done (in case you are able to close the cover of your scanner do it - this depends on the size of your document).
After scanning your document a new PDF file will be created and will be saved under the defined name (the option File Name) on the hard drive of your computer (you can see that the sample document was saved to the folder C:\Users\student\Pictures). The explorer window will appear after the process is done so you do not have to search for your file (see the following image).
You can open the file in Adobe Acrobat after double clicking on it (pay attention, you have to use Adobe Acrobat, not the common Adobe Reader) and this program will allow you to convert your PDF document to text. In case that you scanned your document upside down, you can rotate it to the required direction (from the menu Document > Rotate pages).
In the Rotate Pages window you can set the direction of rotation and set the range or pages which should be rotated. Click OK to confirm.
After this step you should have your scanned file opened and properly rotated (see the following image).
Now you can crop the required text if needed (there might be parts of the surrounding pages, borders of books and more useless elements in the document). You want to get a rectangular part which will contain the required text only, nothing else. Use the Crop tool for this purpose (see the icon in the following image).

In case you do not see this icon in the panel (this panel is not opened) you have to click on the top panel using the right mouse button and select the command More Adjustments.
After this action a new panel with tools for adjustments should appear. This panel is usually opened as a floating one (you can move it by dragging it using mouse).
You can let it be and move it according to your needs or you can place it next to the other tools in the tools panel.
Finally the panel will nicely fit inside the tools panel and you can use it comfortably.
Now you should select the Crop tool by clicking on the icon and create a rectangle in the document using your mouse. The rectangle should contain the required part of document with text which will be extracted. Click on any place in the document (the first corner of the rectangle) and with pressed left button drag the final rectangle. After creating the rectangle you can release the mouse button and adjust the size of the rectangle as you need by dragging its edges.
Once you are satisfied with your crop rectangle, press the ENTER key (in case you hold your mouse cursor over a tool you can get help for this tool after a while - in case you do not know how to use any tool, you can use this way). After pressing the ENTER button the following window should appear. The option Page Range is essential for you because you can use the same way to crop any number of pages, or you can crop one page only. The whole process of cropping will be finished after pressing the OK button.
You document is now ready for the text recognition process. Select the command Recognize text OCR > Recognize text using OCR...
A new window with settings for the recognition will appear. You have to set the required pages where you want to recognize text and then you should check the settings. The option of primary language is very important (you can change the language by clicking on the Edit button). The process of text recognition will be launched in a new window after clicking on the OK button.
Adobe Acrobat will try to recognize letters in the image (OCR). This process will take some time (it depends on the size of image, the amount of text and the performance of your computer). Then you can use tools for selecting text inside your PDF document and copy it to clipboard for example. You have to know that the program might not be able to recognize all letters correctly (the quality of print, chosen font, font types, dirt particles on the scanned paper, similar letters and more can cause difficulties).
You can then insert the text from the clipboard anywhere you want (for example inside a text processor) and edit, change or complete it etc.

Individual Task

Use this scanned document which represents propositions for a tournament. Use the OCR function to load the inner text and create a similar document in Adobe InDesign.

You can use the following logos: logo-zlin, logo-olomouc.

Do not forget to repair all typographical mistakes in the final document and try to create a trustworthy copy. You should also check whether the text recognized by the OCR function does not contain any mistakes - this method is sometimes not fully reliable (you should find one mistake in this document).

Export the final document from Adobe InDesign to PDF; you can see a preview in the following image.

webdesign, xhtml, css, php - Mgr. Michal Mikláš