Cases often contain images with human-readable text in them, e.g. web page screenshots. These images can be embedded in documents, e.g. a scanned or faxed document is packaged as a PDF containing TIFF images, or a chart is embedded as a picture in a Word document.
The techniques for identifying the text in such images (embedded or not) is called Optical Character Recognition, commonly abbreviated to OCR. Application of such OCR techniques can make the textual contents of these images available for keyword search.
Some modern scanners already apply OCR techniques during scanning and add the extracted text to the PDF. If this is the case, Intella Connect will pick up the text automatically during indexing. Often this machine-accessible text is missing though, or it contains too many recognition errors to be useful for keyword searching. Also, loose images do not come with such text at all.
To overcome this, Intella Connect offers OCR support, letting you improve your case index.
Intella’s OCR support is currently a post-processing step, performed manually by the case admin after indexing has completed or as a post-processing task. In the future, we may make this part of the indexing process.
To OCR a collection of search results, you can use the following procedure:
The “OCR Candidates” task condition can be used in order to automate OCR. See the section Admin’s manual > Sources > Post-processing > Tasks for more detailed information on running OCR as a post-processing task.
Intella currently supports three OCR methods:
ABBYY FineReader (embedded)
This method allows to OCR the items using an engine embedded into Intella Connect. The method is fully automatic and doesn’t require any additional software or licenses.
ABBYY Recognition Server
This method consists of sending the files to a Recognition Server for processing, automatically incorporating the received results into the case. This method is fully automatic and requires a licensed and configured instance of ABBYY Recognition Server available over the network. Make sure that your system administrator properly sets up ABBY Recognition Server configuration before using this feature.
External OCR tool
This method consists of exporting the items as loose files, processing them with the user’s preferred OCR software, and importing the OCRed files back into the case.
This method is fully automated and doesn’t require to install any additional software or licenses. The method utilizes the ABBYY FineReader engine embedded into Intella Connect.
Steps to OCR selected items with ABBYY FineReader (embedded): - Specify the profile that allows to set the balance between speed and quality:
- Accuracy. OCRing may take longer time, but produce better quality output.
- Speed. OCRing may be faster, but produce less quality output.
When you have access to an ABBYY Recognition Server, you can utilize it to OCR selected items in the case fully automatically.
Note: ABBYY Recognition Server 3.5 or 4.0 should be used.
Steps to OCR selected items with ABBYY Recognition Server:
The selected documents are will now be send to the Recognition Server. The results that it sends back will be processed automatically, similar to how the external method works.
Please make sure that your ABBYY Recognition Server is configured correctly:
Parameters:
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<system.web>
<httpRuntime maxRequestLength="409600" />
</system.web>
<system.webServer>
<security>
<requestFiltering>
<requestLimits maxAllowedContentLength="300000000" />
</requestFiltering>
</security>
</system.webServer>
</configuration>
To OCR the selected items with an external OCR tool, you initially need to create an export package (ZIP archive). Once you click the “Ok” button, Intella Connect will export the items in their original format to the ZIP package. Every file will be named after the MD5 of the item – note that this means that unique items are only exported once! You can download that package from the Background Tasks list (download link will be shown in “Download” column once the relevant Background Task is completed).
Download and unzip the export package. Next you can use any OCR tool to process the exported files.
To import the OCRed files back to Intella Connect, the tool and its configuration should comply with the following requirements:
Use the “Skip OCRed items” checkbox to skip items that have already been OCRed before. Uncheck the “Skip OCRed items” in order to replace any existing OCRed text with the new one. The “Import as” option can be used to specify the format for the OCRed files, otherwise Intella Connect will try to detect it automatically. Click on the Import button to import the files.
After you have OCRed the files, ZIP all of them to a single ZIP archive and go back to the Background Tasks list. You will now have to create a second Background Task, but this time using “Import OCR package” option. Use the file upload box to drag and drop the package file or press “Select” button to open a file chooser. Click “Ok” to start importing the package.
Intella Connect will analyze every file in the specified package, extract the text and link it to the original item and all its copies. The imported OCRed text can be found under a separate OCR tab in the previewer.
To find all items in a case that have been OCRed, you can use the OCRed category in the Features facet. This attribute is also reflected in the Details table in the OCRed column. When an OCRed item is previewed, this will be shown as an additional property in the Properties tab.
Note that when the OCR software enhances an existing PDF document by inserting the text in it, this text will be extracted and added to the index, but the binary item stored in the case is not replaced. This means that when exporting or previewing that item, you get the original PDF, not the OCR-enhanced PDF. This will be addressed in a future version.
Note: when converting an old case created with Intella Connect 2.0.1 or older to the 2.1 format, the OCRed text will NOT be transferred. It will appear under the Contents tab instead of the OCR tab.