20. Optical Character Recognition (OCR)

Cases often contain images with human-readable text in them, e.g. web page screenshots. These images can be embedded in documents, e.g. a scanned or faxed document is packaged as a PDF containing TIFF images, or a chart is embedded as a picture in a Word document.

The techniques for identifying the text in such images (embedded or not) is called Optical Character Recognition, commonly abbreviated to OCR. Application of such OCR techniques can make the textual contents of these images available for keyword search.

Some modern scanners already apply OCR techniques during scanning and add the extracted text to the PDF. If this is the case, Intella Connect will pick up the text automatically during indexing. Often this machine-accessible text is missing though, or it contains too many recognition errors to be useful for keyword searching. Also, loose images do not come with such text at all.

To overcome this, Intella Connect offers OCR support, letting you improve your case index.

20.1. Starting OCR

Intella’s OCR support is currently a post-processing step, performed manually by the case admin after indexing has completed or as a post-processing task. In the future, we may make this part of the indexing process.

To OCR a collection of search results, you can use the following procedure:

  1. Use Ctrl-click or Shift-click to select multiple items in the Details pane, using the table, list or thumbnails view.
  2. Right-click and choose “Add Tags…”. Tag the items you wish to OCR with a new tag, ex. ocr-1. You can skip this step if you wish to OCR items based on some existing tag.
  3. Open “Preferences” and navigate to “Background tasks”. Click the “Add new” button. This will open a dialog allowing you to further customize new Background Task.
  4. In the left panel locate the section labeled “OCR” and choose appropriate method of OCRing (they are described in the next section).
  5. Regardless of the selected method, the first step is to pick the tag you created in step 2. in the “Select tag” dropdown located in the panel on the right. Intella Connect will use this tag to find items that will be a subject of further OCR process.
  6. Carry on with the OCR process.

The “OCR Candidates” task condition can be used in order to automate OCR. See the section Admin’s manual > Sources > Post-processing > Tasks for more detailed information on running OCR as a post-processing task.

You can also use the OCR button in the previewer to OCR the current item using the embedded ABBYY FineReader engine.

20.2. OCR methods

Intella currently supports three OCR methods:

  • ABBYY FineReader (embedded)

    This method allows to OCR the items using an engine embedded into Intella Connect. The method is fully automatic and doesn’t require any additional software or licenses.

  • ABBYY Recognition Server

    This method consists of sending the files to a Recognition Server for processing, automatically incorporating the received results into the case. This method is fully automatic and requires a licensed and configured instance of ABBYY Recognition Server available over the network. Make sure that your system administrator properly sets up ABBY Recognition Server configuration before using this feature.

  • External OCR tool

    This method consists of exporting the items as loose files, processing them with the user’s preferred OCR software, and importing the OCRed files back into the case.

20.3. Using ABBYY FineReader (embedded)

This method is fully automated and doesn’t require to install any additional software or licenses. The method utilizes the ABBYY FineReader engine embedded into Intella Connect.

Steps to OCR selected items with ABBYY FineReader (embedded): - Specify the profile that allows to set the balance between speed and quality:

  • Accuracy: OCRing may take longer time, but produce better quality output.
  • Speed: OCRing may be faster, but produce less quality output.
  • Specify the languages that are used in the items. Note that adding more languages will make the process slower.
  • Specify the number of workers. It should match the number of logical CPU cores on your machine in order to achieve the best performance.
  • Specify the output format: Plain Text or PDF. If the PDF format is selected Intella will store both OCRed text and searchable PDF version of the document.
  • Use the “Detect page orientation” option to automatically rotate an image if its orientation differs from normal.
  • Use the “Correct inverted images” option to detect whether an image is inverted (white text against black background).
  • Use the “Skip OCRed items” checkbox to skip items that have already been OCRed before. Otherwise, Intella Connect will replace any existing OCRed text.
  • Click the “OK” button to start the OCR process.

20.4. Using ABBYY Recognition Server

When you have access to an ABBYY Recognition Server, you can utilize it to OCR selected items in the case fully automatically.

Note

ABBYY Recognition Server 3.5 or 4.0 should be used.

Steps to OCR selected items with ABBYY Recognition Server:

  • Make sure with your administrator that the ABBYY Recognition Server integration in Intella Connect has been properly configured.
  • Start creating a new Background Task with type “ABYY Recognition Server”, as described above.
  • Optionally, you can skip OCR process for items which have already been OCRed.
  • Click the “OK” button to start the OCR process.

The selected documents are will now be send to the Recognition Server. The results that it sends back will be processed automatically, similar to how the external method works.

Please make sure that your ABBYY Recognition Server is configured correctly:

  • A separate document should be generated for each input file.
  • The output format is a format that Intella can index.
  • The following parameters need to be set correctly in the following file (suggested parameters allow for processing files up to 30 MB): C:Program Files (x86)ABBYY Recognition Server 3.5RecognitionWSweb.config

Parameters:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
 <system.web>
 <httpRuntime maxRequestLength="409600" />
 </system.web>
 <system.webServer>
 <security>
 <requestFiltering>
 <requestLimits maxAllowedContentLength="300000000" />
 </requestFiltering>
 </security>
 </system.webServer>
</configuration>

20.5. Using an external OCR tool

To OCR the selected items with an external OCR tool, you initially need to create an export package (ZIP archive). Once you click the “Ok” button, Intella Connect will export the items in their original format to the ZIP package. Every file will be named after the MD5 of the item – note that this means that unique items are only exported once! You can download that package from the Background Tasks list (download link will be shown in “Download” column once the relevant Background Task is completed).

Download and unzip the export package. Next you can use any OCR tool to process the exported files.

To import the OCRed files back to Intella Connect, the tool and its configuration should comply with the following requirements:

  • The OCR tool must be able to create a single OCRed file for each input file. Put these files in a separate folder.
  • The file name of the OCR output must match the original file name, but it may have a different file extension, per the file type produced by the OCR tool. For example, if the original file name is 6345b60187d08be573133376d7543c54.tif, then the OCRed file name can be 6345b60187d08be573133376d7543c54.txt.
  • The OCRed file format must be of one of the Intella Connect supported formats, e.g. plain text, PDF, MS Office, etc.

Use the “Skip OCRed items” checkbox to skip items that have already been OCRed before. Uncheck the “Skip OCRed items” in order to replace any existing OCRed text with the new one. The “Import as” option can be used to specify the format for the OCRed files, otherwise Intella Connect will try to detect it automatically. Click on the Import button to import the files.

After you have OCRed the files, ZIP all of them to a single ZIP archive and go back to the Background Tasks list. You will now have to create a second Background Task, but this time using “Import OCR package” option. Use the file upload box to drag and drop the package file or press “Select” button to open a file chooser. Click “Ok” to start importing the package.

Intella Connect will analyze every file in the specified package, extract the text and link it to the original item and all its copies. The imported OCRed text can be found under a separate OCR tab in the previewer.

20.6. Reviewing OCRed items

To find all items in a case that have been OCRed, you can use the OCRed category in the Features facet. This attribute is also reflected in the Details table in the OCRed column. When an OCRed item is previewed, this will be shown as an additional property in the Properties tab.

When importing OCRed documents Intella Connect will extract text, add it to the index and store searchable (original view) version of the document. The text can be found in the OCR tab of the previewer. The original view can be found in the OCR Preview tab. Note that the original content of the item will not be replaced. See the Exporting section for more details about exporting OCRed text and original view.

Note

When converting an old case created with Intella Connect 2.0.1 or older to the 2.1 format, the OCRed text will NOT be transferred. It will appear under the Contents tab instead of the OCR tab.