Using OCR and SharePoint Metadata to Make Image Files Searchable

July 24, 2009

1 minute read

Many of our discussions with customers using SharePoint have been about making content searchable using OCR (Optical Character Recognition) to convert image files to PDF. The converted PDF file looks like the original but also includes a text layer that can be indexed by SharePoint so that the document can be found using the search engine. Some examples of image files include scanned invoices, image-only PDF files of research material, and legal documents. In many cases, the original image files are discarded after the rendition is created to save storage space because the converted PDF file retains the look of the original as well as the recognized text.

This approach isn’t acceptable for all industries. For example, in the insurance industry the original files must be retained in the event of litigation – a common occurrence in their line of business.  Since SharePoint does not support the concept of document renditions [an alternate representation of a document], establishing a relationship between the original and the searchable PDF is a challenge. They would also like to avoid storing multiple versions of files to keep storage costs down. One company we spoke with had over 1 million documents in SharePoint and were required to keep them available for long periods of time because claims typically involve large sums of money and take years to settle.

A solution for this business problem is to use our document conversion workflow to perform OCR on image files loaded into SharePoint and use the extracted text to supplement the original file instead of making a PDF rendition. The text extracted by the Express recognition engine is added to the original image file as SharePoint metadata. This makes it possible for users to easily find the file based on its (previously hidden) content.

Don’t forget to share this post